Relational and key-value databases rely on a small set of data structures and algorithms to store, retrieve, and protect data. Understanding these internals is necessary for effective schema design, index selection, and performance tuning. This document covers B-trees, LSM trees, write-ahead logging, MVCC, the buffer pool, and various index strategies, with specific reference to PostgreSQL, MySQL/InnoDB, RocksDB, and SQLite.
B-Trees
B-trees are the primary index structure in PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, and SQLite. A B-tree is a self-balancing tree in which each node holds multiple keys and may have multiple children. The high fanout of each node minimizes the number of disk reads required to locate any given key.
A simplified B-tree of order 3 (maximum 3 keys per node):
[30 | 70]
/ | \
/ | \
[10 | 20] [40 | 50 | 60] [80 | 90]
/ | \ / | | \ / | \
/ | \ / | | \ / | \
[1-9][11-19][21-29][31-39][41-49][51-59][61-69][71-79][81-89][91-99]
^ ^
| |
+------------- Leaf nodes linked together ---------------+
(enables efficient range scans)
Fanout and Tree Height
PostgreSQL uses an 8 KB page size. With 8-byte index keys and 6-byte page pointers, roughly 500+ keys fit per node. A B-tree with fanout 500 requires the following number of levels to index a given row count:
| Levels | Maximum Rows Indexed |
|---|---|
| 1 | 500 |
| 2 | 250,000 |
| 3 | 125,000,000 |
| 4 | 62,500,000,000 |
Four disk reads are sufficient to locate any row among 62.5 billion entries. In practice, the root and upper internal nodes remain cached in the buffer pool, reducing actual I/O to 1 or 2 reads for most lookups.
MySQL/InnoDB uses a 16 KB page size, yielding even higher fanout. SQLite defaults to 4 KB pages but supports configurable sizes up to 64 KB.
Node Splits
When an insert targets a full node, the node splits. The median key is promoted to the parent node. If the parent is also full, the split cascades upward, potentially creating a new root.
Before inserting 25 into a full leaf node:
Parent: [20 | 40]
|
Leaf: [21 | 23 | 24 | 26] <-- full (max 4 keys)
Step 1: Insert 25, node overflows:
Leaf: [21 | 23 | 24 | 25 | 26] <-- exceeds capacity
Step 2: Split at median (24), promote median to parent:
Parent: [20 | 24 | 40]
/ \
[21 | 23] [25 | 26]
Step 3: If the parent also overflows, it splits upward:
New Root: [24]
/ \
[20] [40]
| |
[21 | 23] [25 | 26]
Cascading splits have implications for write patterns. Random primary keys such as UUIDv4 distribute inserts across all leaf pages, causing widespread page dirtying and poor buffer pool utilization. Sequential keys (auto-increment integers, ULIDs, UUIDv7) concentrate inserts at the rightmost leaf, keeping the working set small.
In InnoDB, where the primary key is the clustered index, a random UUID primary key fragments the entire table's physical layout. Replacing UUIDv4 with UUIDv7 (timestamp-prefixed) in a table with several million rows can reduce insert latency by 40-50% and cut page fragmentation significantly.
Page Layout
A "node" in a B-tree corresponds to a page (or block) on disk. This is the fundamental I/O unit.
PostgreSQL 8 KB Page Layout:
+------------------------------------------------------+
| Page Header (24 bytes) |
| - LSN (Log Sequence Number for WAL recovery) |
| - Checksum |
| - Flags, free space pointers |
+------------------------------------------------------+
| Item Pointers (array of offsets, grows downward) |
| - ItemId 1 -> offset 8140 |
| - ItemId 2 -> offset 8020 |
| - ItemId 3 -> offset 7900 |
| - ... |
+------------------------------------------------------+
| |
| Free Space |
| |
+------------------------------------------------------+
| - Tuple 3 (row data at offset 7900) |
| - Tuple 2 (row data at offset 8020) |
| - Tuple 1 (row data at offset 8140) |
| (tuple area grows upward from bottom) |
+------------------------------------------------------+
| Special Space (index-specific metadata) |
+------------------------------------------------------+
Item pointers grow downward from the top of the page. Tuple data grows upward from the bottom. They meet in the middle. When free space is exhausted, the page is full.
In PostgreSQL, if an updated row no longer fits on its original page, it is written to a different page with a forwarding pointer left behind. This contributes to heap bloat, which VACUUM addresses (discussed in the MVCC section).
Indexes
Index-Only Scans and Covering Indexes
If an index contains every column referenced by a query, the database can answer the query without visiting the heap. This is an index-only scan.
CREATE INDEX idx_orders_date_total ON orders(created_at, total_amount);
-- Index-only scan: all referenced columns are in the index
SELECT created_at, total_amount FROM orders
WHERE created_at BETWEEN '2023-01-01' AND '2023-06-30';
-- Requires heap access: customer_name is not in the index
SELECT created_at, total_amount, customer_name FROM orders
WHERE created_at BETWEEN '2023-01-01' AND '2023-06-30';In PostgreSQL, index-only scans still consult the visibility map to confirm tuple visibility under MVCC. If the visibility map is stale (because VACUUM has not run recently), PostgreSQL falls back to heap fetches even when an index-only scan would otherwise suffice.
Using EXPLAIN ANALYZE
EXPLAIN shows the planner's estimated plan. EXPLAIN ANALYZE executes the query and reports actual timings and row counts. The difference between estimated and actual values reveals stale statistics, cardinality misestimates, and suboptimal plan choices.
EXPLAIN ANALYZE SELECT * FROM orders
WHERE created_at BETWEEN '2023-01-01' AND '2023-06-30'
ORDER BY created_at;
-- Expected output (index scan):
-- Index Scan using idx_orders_created_at on orders
-- (cost=0.43..15234.12 rows=45123 width=85)
-- (actual time=0.031..42.156 rows=44892 loops=1)
-- Index Cond: (created_at >= '2023-01-01' AND created_at <= '2023-06-30')
-- Buffers: shared hit=3421 read=12
-- Problematic output (sequential scan):
-- Seq Scan on orders
-- (cost=0.00..289334.00 rows=45123 width=85)
-- (actual time=0.018..1823.445 rows=44892 loops=1)
-- Filter: (created_at >= '2023-01-01' AND created_at <= '2023-06-30')
-- Rows Removed by Filter: 4955108
-- Buffers: shared hit=1200 read=148334A sequential scan on 5 million rows to return 45,000 results indicates one of the following: table statistics are stale (resolved by running ANALYZE), the planner estimates the result set is large enough that sequential I/O outperforms random I/O from index lookups, or the query selects too many columns for an index-only scan to be viable.
Partial Indexes
Partial indexes cover a subset of rows, defined by a WHERE clause. They are smaller, faster to maintain, and faster to scan.
-- Full index on all 50 million rows:
CREATE INDEX idx_orders_status ON orders(status);
-- Partial index on only the relevant subset:
CREATE INDEX idx_active_orders ON orders(status)
WHERE status IN ('pending', 'processing');If 95% of rows have status = 'completed', the partial index is roughly 20x smaller than the full index. The planner uses the partial index when the query's WHERE clause matches or implies the index predicate.
Covering Indexes with INCLUDE
PostgreSQL 11 introduced the INCLUDE clause, which adds non-key columns to an index's leaf nodes without incorporating them into the B-tree sort order.
CREATE INDEX idx_orders_lookup ON orders(customer_id, created_at)
INCLUDE (total_amount, status);Queries filtering on customer_id and created_at that only project total_amount and status can execute as pure index-only scans. The included columns occupy leaf-node space but do not affect tree structure or sort order.
Composite Index Key Ordering
A composite index (a, b, c) supports efficient lookups on (a), (a, b), and (a, b, c). It does not support efficient lookups on (b), (c), or (b, c) alone. This is the leftmost prefix rule, a direct consequence of B-tree sort order.
Consider a composite index sorted by (last_name, first_name). The tree can efficiently locate all entries with last_name = 'Smith', or last_name = 'Smith' AND first_name = 'Jane'. It cannot efficiently locate all entries with first_name = 'Jane' regardless of last name, because the entries are not sorted by first_name at the top level.
-- Given: CREATE INDEX idx_composite ON users(department, hire_date, salary);
-- Uses full index:
SELECT * FROM users WHERE department = 'eng' AND hire_date > '2023-01-01' AND salary > 100000;
-- Uses first two columns of index:
SELECT * FROM users WHERE department = 'eng' AND hire_date > '2023-01-01';
-- Uses first column only:
SELECT * FROM users WHERE department = 'eng';
-- Cannot use this index efficiently:
SELECT * FROM users WHERE hire_date > '2023-01-01' AND salary > 100000;
-- Cannot use this index at all:
SELECT * FROM users WHERE salary > 100000;LSM Trees
Log-Structured Merge Trees optimize for write throughput by buffering all writes in memory and flushing to disk sequentially. LSM trees power RocksDB, LevelDB, Cassandra, HBase, and CockroachDB's storage layer.
LSM Tree Architecture:
Writes
|
v
+---------------------+
| MemTable | In-memory sorted structure (red-black tree
| (mutable, sorted) | or skip list). All writes go here.
| | Typical threshold: 64 MB.
+---------------------+
|
| (MemTable reaches threshold, becomes immutable)
v
+---------------------+
| Immutable MemTable | Background thread flushes to disk
| (pending flush) | as a sorted SSTable file.
+---------------------+
|
v
+================================================================+
| ON DISK |
| |
| Level 0 (L0): [SST-1] [SST-2] [SST-3] [SST-4] |
| - Recently flushed SSTables |
| - Key ranges MAY OVERLAP between files |
| - Target total size: ~256 MB |
| |
| Level 1 (L1): [SST-A] [SST-B] [SST-C] [SST-D] [SST-E] |
| - Non-overlapping key ranges |
| - Each file covers a distinct key range |
| - Target total size: ~2.56 GB (10x L0) |
| |
| Level 2 (L2): [SST-...many files...] |
| - Non-overlapping key ranges |
| - Target total size: ~25.6 GB (10x L1) |
| |
| Level 3 (L3): [SST-...more files...] |
| - Non-overlapping key ranges |
| - Target total size: ~256 GB (10x L2) |
| |
+================================================================+
Compaction
As SSTables accumulate, read performance degrades because multiple files must be checked for any given key. Compaction merges SSTables, removing duplicate keys and tombstones (deletion markers).
Leveled Compaction: Merging L0 into L1
Before:
L0: [a-f] [c-k] [b-d] <-- overlapping key ranges
L1: [a-d] [e-h] [i-m] <-- non-overlapping
Step 1: Identify L0 files overlapping with target L1 range.
All three L0 files overlap with L1 files [a-d] and [e-h].
Step 2: Multi-way merge sort of selected files.
For duplicate keys, retain the newest version.
Discard tombstoned entries past their retention period.
Step 3: Write new L1 files with non-overlapping ranges:
L1: [a-c] [d-f] [g-h] [i-m]
Step 4: Atomically delete old input files.
RocksDB supports multiple compaction strategies. Leveled compaction (the default) limits space amplification but increases write amplification. Tiered (universal) compaction reduces write amplification at the cost of higher space amplification. FIFO compaction is suitable for time-series data where old entries can be discarded.
Write Amplification
A single key-value pair may be physically written multiple times: once to the WAL, once during the MemTable flush, then once per compaction level (L0 to L1, L1 to L2, and so on). Total write amplification of 10-30x is common. This is a significant consideration for SSD endurance.
Read Amplification and Bloom Filters
Point reads in an LSM tree may need to check the MemTable, then each level on disk. Bloom filters reduce unnecessary SSTable reads.
Bloom Filter Operation:
Query: "Does this SSTable contain key X?"
key X --> hash1(X) = 3, hash2(X) = 7, hash3(X) = 11
Bit array: [0|0|0|1|0|0|0|1|0|0|0|1|0|0|0|0]
^ ^ ^
bit 3 bit 7 bit 11
All bits set? --> "Possibly present" (read the file)
Any bit unset? --> "Definitely absent" (skip the file)
Configuration: 10 bits per key yields ~1% false positive rate.
This eliminates ~99% of unnecessary SSTable reads.
B-Tree vs. LSM Tree Trade-offs
| Characteristic | B-Tree | LSM Tree |
|---|---|---|
| Write amplification | Moderate (in-place updates) | High (compaction rewrites) |
| Read amplification | Low (single tree traversal) | Moderate (check multiple levels) |
| Space amplification | ~1.5x (pages fill to ~67%) | ~1.1x (compaction reclaims) |
| Write throughput | Good | Excellent |
| Point read latency | Excellent | Good |
| Range scan performance | Excellent | Good |
| Space reclamation | Immediate (page reuse) | Deferred (compaction) |
| Latency predictability | High | Lower (compaction-induced jitter) |
The choice depends on the workload. A 90% write workload benefits from LSM-based storage. A read-heavy workload with strict latency requirements favors B-tree engines.
Write-Ahead Logging (WAL)
WAL provides durability. The invariant: before any data page modification is applied, a corresponding log record must be persisted to stable storage. On crash, the database replays the log to reconstruct any changes that were committed but not yet flushed to data files.
WAL Write Flow:
Statement: UPDATE accounts SET balance = 500 WHERE id = 42;
+------------+
| SQL Parser |
+-----+------+
|
v
+-----+--------+
| Query Planner |
| (Index Scan |
| on PK) |
+-----+--------+
|
v
+-----+------------------------+
| Executor locates row |
| Page 7, Offset 128 |
| Current value: balance = 1000 |
+-----+------------------------+
|
v
+-----+------------------------+
| Generate WAL record: | +--------------------------+
| Transaction ID: 4821 +----->| WAL Buffer (memory) |
| Relation: accounts | | [LSN 1001] TXN 4821 |
| Page: 7, Offset: 128 | | page=7, off=128 |
| Old value: 1000 | | old=1000, new=500 |
| New value: 500 | +--------------------------+
+-----+------------------------+ |
| |
v |
+-----+------------------------+ |
| Modify page 7 in Buffer Pool | |
| (page marked "dirty") | |
+-----+------------------------+ |
| v
v +--------+------------------+
+-----+------------------------+ | fsync WAL to disk |
| COMMIT | | (this is the durability |
| Return success to application +------>| guarantee. Once the WAL |
+-------------------------------+ | record is on disk, the |
| transaction is durable.) |
+---------------------------+
|
(asynchronously, minutes later) |
v
+--------+------------------+
| CHECKPOINT |
| Flush dirty pages from |
| buffer pool to data files|
| Advance WAL reclaim point|
+----------------------------+
Checkpoints
The WAL cannot grow indefinitely. A checkpoint flushes all dirty pages with LSN values at or below the checkpoint LSN to the data files. After a checkpoint completes, WAL segments prior to that point can be recycled.
Checkpoint Mechanics:
Timeline of WAL records:
[rec1][rec2][rec3][rec4][rec5][rec6][rec7][rec8][rec9]
^ ^
last checkpoint current position
Checkpoint procedure:
1. Record current WAL position (e.g., LSN = 6)
2. Flush all dirty pages with LSN <= 6 to data files
3. fsync data files
4. Write checkpoint record to WAL
5. WAL segments before LSN 6 are now reclaimable
Recovery scenarios:
- Crash after checkpoint: replay only records 7, 8, 9. Fast recovery.
- Crash with stale checkpoint: replay all records since last checkpoint.
This is why PostgreSQL defaults checkpoint_timeout to 5 minutes.
PostgreSQL's checkpoint_completion_target (default 0.9) spreads checkpoint writes over 90% of the checkpoint interval to avoid I/O spikes. Setting it lower concentrates writes into a shorter burst, causing more pronounced I/O latency during checkpoints.
WAL in SQLite
SQLite's WAL implementation is simpler due to its single-file, embedded architecture. The WAL file resides alongside the main database file. Readers access the database file directly while a single writer appends to the WAL. Checkpointing copies modified pages from the WAL back into the main database file. There is no shared buffer pool or inter-process memory management.
The Buffer Pool
The buffer pool is a region of shared memory that caches disk pages. All reads and writes go through the buffer pool. Direct disk access does not occur during normal query execution.
Buffer Pool (e.g., 8 GB shared_buffers in PostgreSQL):
+--------+--------+--------+--------+--------+--------+--------+--------+
| Page | Page | Page | Page | Page | Page | Page | Page |
| 42 | 107 | 1893 | 7 | 553 | 891 | 2201 | FREE |
| clean | dirty | clean | dirty | clean | clean | dirty | |
| pin:0 | pin:1 | pin:0 | pin:0 | pin:3 | pin:0 | pin:0 | |
+--------+--------+--------+--------+--------+--------+--------+--------+
States:
clean = page matches on-disk copy
dirty = page modified in memory, not yet flushed
pin:N = N active references (pinned pages cannot be evicted)
Lookup path:
"Need page 7 of table orders"
--> hash(table_oid, page_number)
--> buffer pool slot 3
--> page is cached, no disk I/O required
Eviction (PostgreSQL uses clock-sweep):
1. Clock hand sweeps buffer pool slots
2. Skip pinned pages
3. Decrement usage_count of unpinned pages
4. Evict first page with usage_count = 0
5. If evicted page is dirty, flush to disk before evicting
6. Frequently accessed pages maintain high usage_count,
keeping them resident
Sizing the Buffer Pool
PostgreSQL's shared_buffers default of 128 MB is inadequate for most production workloads. The standard recommendation is 25% of total system RAM. The remaining memory serves the OS page cache, which provides a second layer of caching.
A database serving a working set of 2 GB with shared_buffers = 128 MB will evict and reload pages continuously. Increasing shared_buffers to 4 GB eliminates this thrashing for that workload. The improvement in average query latency can be an order of magnitude.
Double-caching (the same page in both shared_buffers and the OS page cache) is an inherent inefficiency of this architecture. Some databases (e.g., MySQL with O_DIRECT) bypass the OS cache to avoid this. PostgreSQL does not use O_DIRECT by default.
Monitoring Buffer Pool Hit Rate
SELECT
sum(heap_blks_read) AS disk_reads,
sum(heap_blks_hit) AS buffer_hits,
round(
sum(heap_blks_hit)::numeric /
NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0),
4
) AS hit_ratio
FROM pg_statio_user_tables;
-- A hit_ratio below 0.99 indicates that the buffer pool is undersized
-- for the working set, or that the working set exceeds available RAM.For InnoDB, the equivalent metric is available via:
SHOW ENGINE INNODB STATUS;
-- Look for "Buffer pool hit rate" in the output.
-- Also available via:
SELECT
(1 - (innodb_buffer_pool_reads / innodb_buffer_pool_read_requests)) AS hit_ratio
FROM (
SELECT
VARIABLE_VALUE AS innodb_buffer_pool_reads
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) r,
(
SELECT
VARIABLE_VALUE AS innodb_buffer_pool_read_requests
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) rr;MVCC (Multi-Version Concurrency Control)
MVCC allows concurrent transactions to see consistent snapshots of the database without read locks blocking writes. Each transaction sees data as it existed at the transaction's start time.
PostgreSQL MVCC Implementation
PostgreSQL stores version information directly in each tuple using xmin (the transaction ID that created the tuple) and xmax (the transaction ID that deleted or replaced the tuple).
MVCC Versioning Example:
Initial state, row account_id=1:
+------+-------+-----------+---------+
| xmin | xmax | account_id| balance |
+------+-------+-----------+---------+
| 100 | - | 1 | 1000 |
+------+-------+-----------+---------+
xmin=100: created by committed transaction 100
xmax=-: no transaction has deleted or updated this version
Transaction A (txid=200) begins:
Snapshot: "visible = committed before txid 200"
SELECT balance FROM accounts WHERE account_id = 1;
Row xmin=100 (committed, < 200), xmax=null --> visible
Result: 1000
Transaction B (txid=201) executes:
UPDATE accounts SET balance = 500 WHERE account_id = 1;
PostgreSQL does NOT update in place.
It marks the old tuple dead and inserts a new version:
+------+-------+-----------+---------+
| xmin | xmax | account_id| balance |
+------+-------+-----------+---------+
| 100 | 201 | 1 | 1000 | <-- old version, dead after txn 201
| 201 | - | 1 | 500 | <-- new current version
+------+-------+-----------+---------+
Transaction B commits.
Transaction A reads again (still using snapshot txid=200):
Row xmin=100, xmax=201: xmax committed but after snapshot --> still visible
Row xmin=201: created after snapshot --> not visible
Result: 1000 (consistent with original snapshot)
Transaction C (txid=202) begins after B committed:
Row xmin=100, xmax=201: xmax committed and before snapshot --> dead, skip
Row xmin=201, xmax=null: committed and before snapshot --> visible
Result: 500
InnoDB MVCC Implementation
InnoDB takes a different approach. The main clustered index always contains the latest committed version of each row. Previous versions are stored in the undo log (rollback segment). When a transaction needs an older version, InnoDB follows a chain of undo log entries to reconstruct it.
This design eliminates the need for a VACUUM-like process. The undo log is garbage-collected automatically once no active transaction requires the old versions. However, long-running transactions prevent undo log cleanup, which can cause the undo tablespace to grow substantially.
VACUUM in PostgreSQL
Dead tuples (old row versions no longer visible to any active transaction) accumulate in PostgreSQL tables. Without removal, they cause table bloat: the table occupies far more disk space than the live data requires, and sequential scans must read through dead tuples.
Table Bloat Without VACUUM:
Page 1: [live][dead][dead][live][dead][dead][dead][live]
Page 2: [dead][dead][dead][dead][dead][dead][live][dead]
Page 3: [live][dead][live][dead][dead][dead][dead][dead]
...
Page 5000: [dead][dead][dead][dead][dead][dead][dead][live]
~15% of page space contains live data.
Sequential scans read ALL pages, including those dominated by dead tuples.
After VACUUM:
Page 1: [live][ free space ][live][ free space ][live]
Page 2: [live][ free space ]
Page 3: [live][live][ free space ]
Dead tuples removed. Space marked as reusable for future inserts.
File size does NOT shrink. Pages are not returned to the OS.
After VACUUM FULL (requires exclusive lock, rewrites entire table):
Page 1: [live][live][live][live][live]
Page 2: [live][live][live][live][live]
Table compacted. File size reduced. Indexes rebuilt.
Requires exclusive access for the duration of the operation.
Autovacuum Tuning
PostgreSQL's autovacuum daemon handles routine vacuuming. The default configuration triggers vacuum when dead tuples exceed 20% of the table size (autovacuum_vacuum_scale_factor = 0.2). For tables with heavy update traffic, this threshold is too conservative.
-- Per-table autovacuum configuration for a high-churn table:
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_vacuum_threshold = 1000,
autovacuum_analyze_scale_factor = 0.005
);Reducing the scale factor to 0.01 triggers vacuum when 1% of the table consists of dead tuples, keeping bloat under control. The autovacuum_analyze_threshold and autovacuum_analyze_scale_factor settings control when statistics are refreshed, which affects query planner accuracy.
VACUUM and Visibility Maps
VACUUM also updates the visibility map, a bitmap with one bit per heap page. A set bit indicates that all tuples on that page are visible to all current and future transactions. The visibility map enables index-only scans: if the visibility map confirms a page is all-visible, PostgreSQL does not need to fetch the heap page to check tuple visibility.
Running VACUUM regularly therefore improves index-only scan performance in addition to reclaiming dead space.
Clustered vs. Heap Storage
PostgreSQL: Heap Storage
PostgreSQL stores table data in an unordered heap. Indexes contain pointers (tuple IDs) to heap locations. An index lookup traverses the B-tree, retrieves the tuple ID, then fetches the row from the heap.
InnoDB: Clustered Index Storage
InnoDB stores table data in primary key order within the clustered index B-tree. Leaf nodes of the primary key index contain the full row data.
PostgreSQL (heap + separate indexes):
Primary Key Index Secondary Index
[30|70] [A|M]
/ | \ / \
[10|20] ... [80|90] [A-L] [M-Z]
| |
v v
TID(page 5, slot 2) TID(page 5, slot 2)
| |
+-----------------------------+
|
v
Heap (unordered):
Page 5: [..., {id:10, name:"Alice", ...}, ...]
InnoDB (clustered primary key):
Primary Key Index (IS the table):
[30|70]
/ | \
[10|20] ... [80|90]
|
v
Leaf contains FULL ROW:
{id:10, name:"Alice", email:"alice@...", balance:1000}
Secondary Index:
[A|M]
/ \
[A-L] [M-Z]
|
v
Leaf contains: {name:"Alice", PRIMARY_KEY: 10}
|
(requires second B-tree traversal
into clustered index to retrieve
full row: "bookmark lookup")
Consequences of clustered index storage:
- Primary key range scans are efficient because rows are physically contiguous.
- Secondary index lookups require a bookmark lookup into the clustered index, adding one extra B-tree traversal compared to PostgreSQL's single heap fetch.
- Secondary indexes store the primary key value (not a physical address), so wide primary keys increase the size of every secondary index.
Query Execution Path
A complete trace of a point query in PostgreSQL:
SELECT name, email FROM users WHERE id = 42;
1. PARSE
SQL text --> parse tree
Validate table and column names against system catalog
2. PLAN
Check pg_statistic for table 'users'
Available indexes: pk_users (B-tree on id)
Cost estimates:
Index scan: 4.02 (traverse ~3 B-tree levels + 1 heap fetch)
Seq scan: 15234.00 (read all ~120,000 pages)
Selected plan: Index Scan on pk_users
3. EXECUTE
a. Locate root page of pk_users index
Check buffer pool: hash(index_oid, root_page_number)
If cached: read from shared_buffers (0 disk I/O)
If not cached: read from OS cache or disk, install in buffer pool
b. Traverse B-tree
Root --> internal node --> internal node --> leaf node
Each level: buffer pool lookup, possible disk read
Root and upper levels are almost always cached for active indexes
c. Leaf entry found: id=42 --> TID (page 88, offset 5)
d. Fetch heap page 88 from buffer pool (or disk)
e. Read tuple at page 88, offset 5
Visibility check:
xmin committed and before snapshot? Yes
xmax not set? Yes
Tuple is visible to this transaction
f. Project columns: extract 'name' and 'email' from tuple
4. RETURN
Send result row to client via wire protocol
Typical latency: 0.04 ms (fully cached), 2-10 ms (with disk reads)
Performance Considerations
Write Batching and WAL Syncs
Each transaction commit triggers an fsync of the WAL. Committing 1,000 individual single-row transactions requires 1,000 fsyncs. On an SSD with fsync latency of 1-2 ms, this takes 1-2 seconds. Batching those 1,000 rows into a single transaction requires one fsync, completing in approximately 2 ms.
-- Slow: 1000 individual transactions
BEGIN; INSERT INTO events VALUES (...); COMMIT;
BEGIN; INSERT INTO events VALUES (...); COMMIT;
-- ... repeated 1000 times
-- Total: ~1000 fsyncs, ~1-2 seconds
-- Fast: single transaction
BEGIN;
INSERT INTO events VALUES (...), (...), (...), ... ; -- 1000 rows
COMMIT;
-- Total: 1 fsync, ~2 msPostgreSQL's synchronous_commit = off setting allows the server to report success before the WAL fsync completes, reducing per-transaction latency at the risk of losing the most recent transactions (up to wal_writer_delay, default 200 ms) on crash. This is acceptable for workloads where occasional data loss is tolerable.
Row Width and Page Utilization
Narrower rows fit more tuples per page, reducing the number of pages that sequential scans must read.
| Row width | Rows per 8 KB page | Pages for 1M rows |
|---|---|---|
| 100 bytes | ~80 | 12,500 |
| 500 bytes | ~16 | 62,500 |
| 2000 bytes | ~4 | 250,000 |
| 4000 bytes | ~2 | 500,000 |
Removing unused columns, using appropriately sized data types (smallint vs. bigint, varchar(50) vs. text for bounded fields), and moving large values to separate tables reduces row width. PostgreSQL's TOAST mechanism automatically stores values exceeding ~2 KB out-of-line, but proactive schema design avoids reliance on TOAST for frequently accessed columns.
Transaction Duration and Bloat
In PostgreSQL, the oldest active transaction's snapshot determines the xmin horizon. Dead tuples with xmin values after this horizon cannot be vacuumed, even if no active transaction references them. A single long-running transaction (including idle transactions from forgotten BEGIN statements in interactive sessions) prevents VACUUM from reclaiming dead tuples across the entire database.
-- Identify long-running transactions:
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start;Storage Engine Selection
| Workload Pattern | Recommended Engine |
|---|---|
| Read-heavy, point lookups | B-tree (PostgreSQL, MySQL/InnoDB) |
| Write-heavy, append-only | LSM (RocksDB, Cassandra) |
| Mixed read/write, ACID required | B-tree (PostgreSQL, MySQL/InnoDB) |
| Time-series, high ingest rate | LSM (RocksDB, InfluxDB) |
| Embedded, single-process | SQLite |
| Key-value, extreme write volume | LSM (RocksDB) |
SQLite Internals
SQLite stores an entire database in a single file. It uses B-trees for both tables and indexes, WAL for durability (when WAL mode is enabled), and ACID transactions with serializable isolation.
Key characteristics:
- Single-writer model. WAL mode permits concurrent readers with one writer. Without WAL mode, readers and writers block each other.
- No client-server protocol. SQLite is a library linked into the application process. No network overhead, no connection management, no authentication layer.
- The codebase is approximately 150,000 lines of C. The test suite exceeds 90 million lines, representing one of the highest test-to-code ratios in production software.
- Default page size is 4 KB. Page sizes of 8 KB, 16 KB, 32 KB, and 64 KB are supported.
- The file format is stable and backward-compatible. Database files created in 2004 are still readable by current versions.
SQLite is appropriate for applications with moderate write volume, embedded systems, mobile applications, and any context where the operational overhead of a client-server database is not justified.