Build Your Own Database
A step-by-step guide to building a key-value database from scratch.
If you were to build your own database today, not knowing that databases exist already, how would you do it? In this post, we'll explore how to build a key-value database from the ground up.
A key-value database works more or less like objects in JavaScript—you can store values using a key and retrieve them later using that same key:
$ db set 'hello' 'world'$ db get 'hello'world
Let's find out how they work!
The Humble File
Databases were made to solve one problem:
Problem
How do we store data persistently and then efficiently look it up later?
The typical way to store any kind of data persistently in a computer is to use a file . When we want to store data, we add the key-value pair to the file:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
When we want to look for a specific key, we iterate through the pairs to see if there's a matching key:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
For updates, we'll find the key and replace the value in-place:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
And for deletes, we'll delete the record from the file:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
Easy! We're done right?
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
Mutable Updates
This approach, simple as it is, doesn't actually work very well in practice. The problem lies with the way we're doing updates and deletes—they're wholly inefficient.
To a computer, our file looks something like this—nothing more than a long sequence of bytes:
001:Lorem␣ipsum\n018:dolor␣sit\n005:adipiscing␣elit.\n014:Vestibulum␣varius\n002:vel␣mauris\n007:consectetur␣adipiscing␣elit.\n010:Vestibulum␣varius\n016:vel␣mauris\n003:consectetur␣adipiscing␣elit.
When we go to update or delete a record, we're currently updating that record in-place, which means we potentially have to move all of the data that comes after that record:
001:Lorem␣ipsum\n018:dolor␣sit\n005:adipiscing␣elit.␣vel␣mauris\n014:Vestibulum␣varius\n002:vel␣mauris\n007:consectetur␣adipiscing␣elit.\n010:Vestibulum␣varius\n016:vel␣mauris\n003:consectetur␣adipiscing␣elit.
In this case, updating the record 005
to "adipiscing␣elit.␣vel␣mauris
" means moving all of the records that come after it by 11 bytes (the length of the added string "␣vel␣mauris
"). This can quickly get really costly, especially as our database grows in size!
Append-Only Files
One way to work around the update problem is to make records immutable. In other words, we add the constraint that we can only add new records to the end of the file and never update or delete existing ones.
With this approach, updates are treated the same as inserts—just add a new record to the end of the file:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
But now we have another problem—there are duplicate keys in the file!
To work around this, we have to change our search algorithm to look for the last occurrence of the key instead of the first:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
To delete records, we create a special "tombstone" record that marks the key as deleted. There's no single way to do this, but one way is to use a special value like null
:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
And there we have it! We have a key-value database that uses a file as its storage mechanism. Using it, we can store, find, update, and delete key-value pairs.
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
Now this implementation isn't perfect; right now, there are two major issues:
- The file can get very large. Since we're only appending to the file, the file will grow infinitely over time. Not good!
- Searching is slow. To search for a specific key, we have to potentially iterate through all records in the database. For a database with millions of records, this can take a while!
How can we fix these problems?
Keeping Files Small
Problem
How do we make sure the file doesn't grow indefinitely? Because we're using an append-only file, we need some mechanism to periodically "shrink" the file so it doesn't eventually take over our entire hard drive.
Take a look at our database here after a few updates and deletes:
- $ db set 1 "Lorem ipsum"
- $ db set 18 "dolor sit"
- $ db set 7 "adipiscing elit."
- $ db delete 7
- $ db set 10 "consectetur adipiscing elit."
- $ db delete 1
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
- 007: adipiscing elit.007:adipiscing elit.
- 007: null007:null
- 010: consectetur adipiscing elit.010:consectetur adipiscing elit.
- 001: null001:null
Our database file has six entries, but only two represent actual records—the rest are either deleted or contain stale data. If we can clear all the irrelevant data, we can shrink the file by over 66%!
db.txt
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
- 007: adipiscing elit.007:adipiscing elit.
- 007: null007:null
- 010: consectetur adipiscing elit.010:consectetur adipiscing elit.
- 001: null001:null
Segments and Compaction
Here's an idea: once a file exceeds a certain size, we'll close the file and create a new one. While the new file ingests new data (in the same way we've been doing so far), we'll compact the old file by deleting all of its irrelevant data.
Meaning, we stop adding new data to the file.
Here, we've set the maximum file size to seven records. Notice that the database is full—try clicking on "Add" to add a new record and notice what happens:
001:
Lorem ipsum
018:
dolor sit
007:
adipiscing elit.
007:
null
010:
consectetur adipiscing elit.
001:
null
020:
Vestibulum varius
Now, our database consists of two different files which we'll call segments. Each segment will usually become a lot smaller after compaction, which means we can merge them together as part of the compaction process.
With that, we've made a mechanism to stop our database from growing indefinitely!
Your First Index
Our next problem is on search performance:
Problem
How do we make searching fast? Right now, we have to iterate through all of the records in the database to find a specific key. This is super slow!
What if we use objects? That's right, these little guys:
const hashTable = {};
JavaScript objects, otherwise known as hash tables or dictionaries, are really efficient at storing and looking up key-value pairs:
const hashTable = {hello: "world",foo: "bar",baz: "qux",};const value = hashTable["hello"]; // "world"
It doesn't matter how many records there are—the time it takes to look up and retrieve a value in a hash table is more or less constant. The catch is they must live in memory.
Aside: In-Memory vs. On-Disk
Aside: In-Memory vs. On-Disk
When you write a variable in your code, the computer will "remember" the value of that variable only for as long as the program is running.
let x = 1;x = x + 1;console.log(x); // 2x = x + 1;console.log(x); // 3
This program will always print 2
and 3
because the value of x
"resets" every time we run the program. This is because x
is stored in-memory , and any value stored in memory is discarded when the program stops.
If we want our data to "stick" between runs, we'll need to store it on-disk—in other words, a file.
import fs from "fs";let x = Number(fs.readFileSync("data.txt", "utf8")) || 1;x = x + 1;console.log(x);x = x + 1;console.log(x);fs.writeFileSync("data.txt", x);
This time, x
will print 2
and 3
the first run, and 4
and 5
the second run.
The tradeoff to persistence is performance—accessing data from memory is about 80x faster on average than accessing it from disk.
Here's how the index will work. For every record that we have in our database, we'll store that record's offset—the number of bytes from the beginning of the file to the start of the record—in the index:
file.txt
1:Lorem␣ipsum\n
0
1:Lorem␣ipsum\n
Index
001:
0
Database
001:
Lorem ipsum
The second record, 18: dolor sit
, for example, has an offset of 15 because:
- Each character is 1 byte large;
- The first record is 13 characters long (
1:Lorem ipsum
); - The first record ends with a newline character, which is (at most) 2 bytes long;
This gives us an offset of 13 + 2 = 15
.
One thing to note is that we need an index for each segment because the offset is relative to the start of the file—in other words, the start of each segment.
Searching With Indices
Using an index, our search algorithm can now run a lot more efficiently:
- Starting at the most recent segment, look up the key in the index;
- If the key is found, read the record at the offset;
- If the key is not found, move on to the next segment;
- Repeat (2) and (3) until the key is found or all segments have been searched.
$ Waiting for commands...
Updating Indices
An index is only useful if it's in sync with our data. Whenever we update, delete, or insert a record, we have to change the index accordingly:
- 001: Lorem ipsum001:Lorem ipsum
- 018: dolor sit018:dolor sit
- 001: 0
- 018: 15
Notice what this implies—writing to the database is slower with an index! This is one of the tradeoffs of using an index; we can search for data much faster at the cost of slower writes.
Tradeoffs
An index is great because it lets us query our database much faster, but there are some problems with our specific hash table implementation:
- Keys have to fit in memory. Since we're using an in-memory hash table as our index, all of the keys in our database must fit in memory. This means there's a limit on the number of keys we can store!
- Range queries are inefficient. Our index wouldn't help for search queries; if we wanted to find all the records between the keys
12
and18
, for example, we'd have to iterate through the entire database!
Sorted String Tables
Here's an idea: what if we ensure our database is always sorted by key? By sorting our data, we can immediately make range queries fast:
0 / 7
Unsorted
Find all values > 2 and < 6
- 7
- 3
- 9
- 1
- 4
- 8
- 11
7 is out of bounds, skipping...
Sorted
Find all values > 2 and < 6
- 1
- 3
- 4
- 7
- 8
- 9
- 11
1 is below lower bound, skipping...
Sparse Indices
One benefit of sorting our data is that we no longer need to store the offset of every record in memory.
Take a look at this database with four records. Since there's no logical order to the records, there's no way to determine where a record is without storing its key or searching through the entire database.
- 001: Lorem ipsum001:Lorem ipsum
- 007: amet, consectetur007:amet, consectetur
- 010: adipiscing elit.010:adipiscing elit.
- 018: dolor sit018:dolor sit
- 001: 0
- 007: 15
- 010: 36
- 018: 57
Knowing that 10
has an offset of 50
doesn't help us find where 18
is.
Now if these records were sorted, we could determine the location of each record using any of the keys in the index, even if it's not the key we're looking for.
Let's say our database is sorted but we only had the offset for the key 10
:
- 001: Lorem ipsum001:Lorem ipsum
- 007: amet, consectetur007:amet, consectetur
- 010: adipiscing elit.010:adipiscing elit.
- 018: dolor sit018:dolor sit
- 010: 36
Let's say we want to find the key 18
. We know that 18
is greater than 10
, which means it must be after 10
in the database. In other words, we can start searching for 18
from 10
's offset—36
in this case.
- 001: Lorem ipsum
- 007: amet, consectetur
- 0010: adipiscing elit.
- 018: dolor sit0
- 018: dolor sit
- 010: 36
While this is certainly slower than having the offset for 18
directly, it's still faster than looping through the database in its entirety.
The real unlock here lies in being able to control the trade-off between memory and performance: a denser index means faster lookups, but more memory usage.
Sorting in Practice
Ensuring our database is always sorted is much easier said than done; by definition, sorting data requires moving around records as new ones get added—something that cannot be done efficiently when we're storing data on-disk. This brings us to our problem:
Problem
How do we keep our data sorted and append-only? It's too slow to sort the data on-disk every time we add a new record; is there another way?
The trick is to first sort the data in memory, and then write it to disk.
- When we add a new record, add it to a sorted in-memory list;
- When our in-memory list gets too large, we'll write it to disk;
- When we want to read a record, we'll read the in-memory list first, and then the disk if necessary.
Memory
- 010: Lorem ipsum010: Lorem ipsum
On-Disk
The data structure used to store the in-memory list is usually one optimized for sorted data like a balanced binary search tree or more commonly, a skip list.
Of course, the main downside of having some of your data in-memory is that it's not persistent—if the program crashes or the computer shuts down, all of the data in the in-memory list is lost.
The fix here is thankfully pretty straightforward—every time we add a record to the list, we also write it to an append-only file on disk. This way, we have a backup in case a crash does happen (which it most certainly will).
Memory
- 010: Lorem ipsum010: Lorem ipsum
On-Disk Database
On-Disk Log
- 010: Lorem ipsum010:Lorem ipsum
The append-only file doesn't need to be sorted nor does it need to have every record in the database; only the ones that are currently in memory.
With that, we have our very own key-value database! Let's recap how it works.
Our database starts out empty. When we go to add a new record, we'll add it to a sorted in-memory list, keeping a copy in an append-only file in case of crashes.
Memory
- 010: Lorem ipsum010: Lorem ipsum
On-Disk Log
- 010: Lorem ipsum010:Lorem ipsum
When the in-memory list gets too large, we'll flush the list by writing all of the records to a file in sorted order. In the process, we'll keep note of each record's offset in an index so we can efficiently look them up later.
Memory
- 004: dolor sit004: dolor sit
- 010: Lorem ipsum010: Lorem ipsum
- 012: consectetur elit.012: consectetur elit.
On-Disk Log
- 010: Lorem ipsum010:Lorem ipsum
- 004: dolor sit004:dolor sit
- 012: consectetur elit.012:consectetur elit.
Index
On-Disk Database
When we want to look up a record, we'll first check the in-memory list. If the record isn't there, we'll check the index to see if it's in the on-disk file.
Memory
- 011: amet, consectetur011: amet, consectetur
- 020: Vestibulum varius020: Vestibulum varius
Index
- 004: 16
- 010: 0
- 012: 29
On-Disk Database
- 010: Lorem ipsum010:Lorem ipsum
- 004: dolor sit004:dolor sit
- 012: consectetur elit.012:consectetur elit.
Once a file is saved to disk, it's considered immutable which means we can only ever read from the file and never update it. To work around this, we'll treat updates and deletes the same as inserting new records—add them to the in-memory list.
Memory
- 004: dolor sit004: dolor sit
- 010: Lorem ipsum010: Lorem ipsum
- 012: consectetur elit.012: consectetur elit.
Treating updates and deletes as new records means our file will only ever grow larger. To prevent this, we'll occassionally compact the on-disk files by deleting all duplicate records.
001:
Lorem ipsum
018:
dolor sit
007:
adipiscing elit.
007:
null
010:
consectetur adipiscing elit.
001:
null
020:
Vestibulum varius
Memory
- 010: Lorem ipsum010: Lorem ipsum
On-Disk Log
- 010: Lorem ipsum010:Lorem ipsum
LSM Trees
What we just built actually exists in the real world—it's called an LSM or Log-Structured Merge Tree.
An LSM tree works by combining an in-memory list (often called a memtable) with an on-disk file (typically called a sorted string table or SST) to create a really fast key-value database.
LSM trees are the underlying data structure used for large-scale key-value databases like Google's LevelDB and Amazon's DynamoDB, and they have proven to perform really well at scale—on Prime Day 2020, DynamoDB peaked at 80 million requests per second!
Now, LSM trees aren't perfect, and they're certainly not the only way to structure a database. In fact, relational databases like PostgreSQL or MySQL use a completely different structure called a B-Tree to store their data—but that's a deep dive for another post.
For now, thanks for reading!