Log structured merge tree
A lot of notes are from notes by Sreekanth1
Terminology
- TPC and YCSB
- Transaction Processing Performance Council2 is the de facto benchmark for databases
- YCSB is created for benchmarking large scale serving systems, i.e. HBase, Cassandra where TPC is not accurate and overkill
- Transaction logs
- Used for durability, recovery from crash
- WAL, WBL
- TODO: LSM tree does not seems to be used with transaction logs
- BTrees
- often mean B+ Trees
- Blocks
- 'refers to the contents of one node in the B+ tree, A block in B+ tree contains the keys for guiding the search process'
- 'And number size of block has is related to how tall the tree is going to be.' this seems to be a typo
- 'It has nothing to do with the disk blocks(sectors)'
- Pages
- 'A smallest chunck of informaction db can work with'
- 'Data in a db are kept in a series of pages that are mappted to the disk(sectors)'
- 'A B tree will hold links to these pages'
- 'Multiple pages can be resident on a single disk block (not the same block above)'
- Multi-page Blocks
- ‘A Block that contains multiple pages and can be fit into a single read/write operations’
- 'In LSM tree a block contains as many entires as can fit a disk block'
- Leaf nodes
- ?'The pointer to the records on disk'?
- NO, I think they should be the data
- ?'The pointer to the records on disk'?
- Buffer (Manager)
- 'All operation on pages happen in meory (i.e. search on record id/match record fileds)'
- page in and out like OS does for virtual memory
- Records
- Merge
- 'called compaction in some papers'
- when one level reach its size threshold, it is merged to the next level, i.e. C_0 -> C_1, memory to disk
History
- 'Everyone uses B Tree'
- 'Big Table came out and cited LSM trees'
- 'NoSql boom! And everybody wanted to build LSM'
Users
- Bigtable
- Cassandra
- HBase
- PNUTS
- LevelDB/RocksDB
- SQLite (optional)
- MongoDB
- WiredTiger ?
Slides summary
will be updated after reading other paper and slides ... (it's a merge process as well)
Why
- all write goes to memory
- what if memory fails
- batched disk io
- only merge process leads to disk writes
- all disk IO is sequential
- always append only?
- the empty and filling block part? how to garbage collect old blocks
Cons
- degraded read performance
- need to scan all levels in general cases, i.e. for exact match and unique keys, just hit the memory is enough
- 'sequential search is merge of all components'
- TODO: isn't this duplicate with previous points
- Assumes clustered index
- 'Works well for KV stores'
- TODO: why KV, and I forgot what is clustered index
- 'Works well for KV stores'
- Merge prcoess
- 'Hard to get right' exactly
- 'Causes trouble with Recovery methods'
- The paper did mentioned recovery in section 4
Disk Model
- IOPS (Input/Output Operations per second) is important
- 'In place updates are always Random IO'
- 'Append only are generally Sequential IO'

Comparing random and sequential access in disk and memory3
- the data for the image is quite old, the source is a 2009 book, where SSD is not advanced as today
Database Index
Example: select * from employees where salary > 100
- binary search, very slow
BTrees
- 'Balanced Binary Search Trees are greate in memory'
- AVL/RB/2-3
- the paper suggested AVL
- 'But leads to too much IO on a disk'
Btree of order m
- 'has m children'
- 'all leaves appear at the same level'
- non leaf node with k children has k-1 keys
- orders are high, > 100
- shallow
- root in memory
- insert/deletes are bit more expensive
B+ Tree
- make sequential access faster
- 'double linked list for the leaf nodes'
- 'all keys reside in the leaves'
- 'inner nodes only act as guides'
LSM Tree
- 'Unlike Btree it's not a databstructure'
- Fast write
- TODO: blind write?
- 'Reasonable reads'
LSM Tree Ideas
- 'All writes go to memory, and are buffered there'
- 'All writes are done on a disk block'
- 'Use memory optimized datastructures for memory (Hashes/Skiplists/RB trees)'
'All writes are append only'
- multiple memory components as well (only really useful for concurrency)
C_0
- 'use memory efficient RB/2-3/AVL trees'
- 'No need to have disk block sized'
C_1
- Uses B+ Tree
- 100% full nodes
- Contiguous multi-page that matches disk blokcs
- that could be what the paper keep mentioning, multi-page
Merge
- asynchronously moves newly added items from C_0 to C_1 ... C_n
- only move when one level can not hold anymore
- 'idea is to fill the C_1 block 100% before writing to disk'
'complicate than merge sort because pages that need to move to buffer on read of C_1 blocks'
- TODO: typo? C_1 blocks need to be read to buffer before merge
merge sort: needs an extra array in memory
- C_1 leaf is made buffer resident -> emptying block
- C_0 smallest leaves are merged
- into new C_1 leaf node -> filling block
- TODO: block or disk block
- write to a new disk area
- TODO: continue till emptying block is empty
- TODO: NOTE: there is no FIFO(age related stuff here)
- TODO: sounds like CRDT
Find
skip
Delete
- Exact Match
- Put a tombstone for the key
- Merge process finds the acutal key and annhilates it
- Predicate delete
- i.e.
DELETE FROM user WHERE age < 18 - Put a tombstone for a key matching a filter
- TODO: put tombstone on key or make it a filter for future merge process
- TODO: 'merge allows use to do some fancy things, like find-note(high latency full text search for example)'
- i.e.
Cost comparisions with BTree
- Multipage write
- Batching efficiency
- TODO: append only?
Concurrency
- 'Merge causes additional issues over C_0'
- 'The general approach is to skip these locked entries when merging'
Recovery
- every insert goes to the transaction log
- TODO: is index really part of database? like when crash recovery, the index is also recovered
- check pointing
- TODO: is check point mentioned in the paper?
- Complete all merge operations
- Write C_0 to disk (Stop the world)
- TODO: leveldb have stall problem
- Write the chekpoint log
- Merge cursors
- Component roots
Optimization
Faster read
- TODO: Differential Index
- Bloom filters
- TODO: Fractional Cascading
Faster Merges/Compactions
- TODO: Lock free
- Levels
Cassandra
The idea of Cassandra really explain this well
Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table).
- SSTables are backed by Bloomfilter (bLSM paper)
- TODO: bloom filter
- https://wiki.apache.org/cassandra/MemtableSSTable
- https://wiki.apache.org/cassandra/ArchitectureSSTable
http://stackoverflow.com/questions/15857779/commitlog-and-sstables-in-cassandra-database
LevelDB
- TODO: Fixed size SSTables
- 'Each level is guaranteed to have unique keys'
- TODO: 'Each level is 10x size previous'
- TODO: 'SSTables are promoted to higher levels'
- TODO: 'Memtables are implemented by a skiplist'
References
1. Notes for reading the LSTM tree paper ↩
2. TPC Wiki ↩
3. The Pathologies of Big Data ↩