Preamble – I was still in high school when the fascinating world of data structures first amazed me: for each problem, there is a particular structure that would fit the best, heap, trees, colored trees, oh my?! So, my interest in these structures stayed through the college years too. When working at Compaq as an Oracle administrator for National Healthcare, I did put many of my learnings (and teachings) to real try – e.g. I did have to run and optimize queries accessing sometimes record counts up to a billion, and the endusers expected to have a timely answer. That version of Oracle had some level of manual optimization you could play around with, so I ended up in long nightly calls with Oracle product groups to help me finetune some queries to conjure the results up in (sometimes) a week.
In today’s digital world, databases have become an essential part of modern computing. They are used to store and manage vast amounts of data and are an integral part of many software applications. Over time, various structures have been developed to optimize the performance of databases. In this post, we will take a closer look at some of the key structures driving modern databases.
- Skip List
The skip list is a probabilistic data structure used to implement an ordered set or map. It is essentially a linked list with additional pointers that “skip” some elements, providing a faster search time. Skip lists are useful for maintaining a sorted index in memory, and are commonly used in high-performance databases and search engines.
- Hash Index
A hash index is a data structure that maps keys to values using a hash function. The hash function takes the key as input and returns a unique value that represents the location of the value in the database. Hash indexes are fast for lookups, but not well suited for range queries or sorting.
SSTable stands for “Sorted String Table.” It is a file format used to store data in a sorted order. SSTables are immutable and append-only, which means that once written, they cannot be modified. This makes them very efficient for read operations, as data can be read sequentially without the need for complex index structures.
- LSM Tree
LSM stands for “Log-Structured Merge.” The LSM tree is a data structure that uses a combination of in-memory and on-disk structures to store data. New data is first stored in an in-memory data structure called a memtable. Once the memtable becomes too large, it is flushed to disk in an SSTable format. Over time, multiple SSTables are merged together to form a single larger SSTable. The LSM tree is very efficient for write-intensive workloads, as it minimizes disk I/O operations and can handle large write volumes.
The B-tree is a balanced tree data structure that is commonly used in databases to store and retrieve data. B-trees are optimized for disk-based storage and are designed to minimize disk I/O operations. They work by splitting nodes when they become too full, allowing for fast insertions and deletions while maintaining a balanced tree structure.
- Inverted Index
An inverted index is a data structure used to index text data, such as in a search engine. It works by creating a mapping of each unique word in a document to the documents that contain that word. This allows for fast full-text searches and is commonly used in search engines and document management systems.
- Suffix Tree
A suffix tree is a data structure used to store and index strings. It works by creating a tree structure that represents all possible suffixes of a string. Suffix trees are useful for text processing and are commonly used in natural language processing and bioinformatics.
An R-tree is a spatial index data structure used to index points or rectangles in space. It works by dividing space into smaller rectangles and indexing them based on their position. R-trees are useful for geographic information systems, image processing, and other applications that deal with spatial data.
- Bloom Filter
A Bloom filter is a probabilistic data structure used to test whether an element is a member of a set. It works by hashing each element and setting corresponding bits in a bit array. Bloom filters are space-efficient and provide fast lookups but may produce false positives.
- Cuckoo Hashing
Cuckoo hashing is a hash table algorithm that resolves collisions by rehashing the key to a different hash table. It works by using two hash tables and can provide very fast insertions and lookups.
- Fractal Tree
A fractal tree is a self-similar data structure that is optimized for large-scale data storage and retrieval. It is designed to provide fast insertions, deletions, and lookups, and can handle data sets that are too large to fit into memory.
- Bitmap(ped) Index
A bitmapped index is a data structure used to index data that can be represented as bitmaps (not bitmap as a picture – bitmap as a map of bits). It works by mapping each possible value of a column to a bit in a bitmap, and then performing bitwise operations to filter data.
A trie, also known as a prefix tree, is a data structure used to store and search for strings. It works by storing each character of a string in a tree structure, with each path from the root to a leaf representing a unique string. Back around 2003, we managed to fine tune a trie based algorithm to beat Microsoft’s pessimistic implementation in .NET 1.1 nearly 100 fold 🙂
HyperLogLog is a probabilistic data structure used to estimate the cardinality of a set. It works by using a hash function to map elements to a large number of “buckets,” and then counting the number of unique buckets. HyperLogLog provides a space-efficient way to estimate the size of large data sets.
These are just a few examples of modern structures that are used to optimize the performance of databases. As data sets continue to grow and evolve, new structures will likely be developed to meet the needs of modern computing – eg we haven’t touched the structures needed for quantum computing yet 🙂 In conclusion, modern databases rely on a variety of key structures to optimize performance and efficiently store and retrieve data. From skip lists and hash indexes to B-trees and inverted indexes, each data structure has its strengths and weaknesses, and choosing the right one for a particular application requires careful consideration of the specific use case.