Skip to main content
SDMastery
advanced12 min readUpdated 2026-06-08

B-Trees

B-Trees are self-balancing tree data structures that maintain sorted data in pages optimized for disk I/O, forming the backbone of indexes in PostgreSQL,.

Diagram showing the key components and data flow in a B-Trees system design
High-level overview of B-Trees
B-Trees

B-Trees are self-balancing tree structures where each node can hold hundreds of keys and is sized to match a disk page (typically 4-16KB), minimizing the number of disk reads needed to find any key. With a branching factor of several hundred, a B-Tree with billions of keys is only 3-4 levels deep, meaning any lookup requires at most 3-4 disk reads. This makes B-Trees the dominant index structure in relational databases like PostgreSQL, MySQL, SQL Server, and Oracle, where read performance and range queries are paramount.

AspectDetails
What it isA balanced tree structure with high fan-out nodes sized to disk pages, enabling any key lookup in O(log N) disk reads with N potentially in the billions
When to useRead-heavy workloads, range queries, ordered access patterns, and general-purpose indexing — the default choice for relational database indexes
When NOT to useExtremely write-heavy workloads where sequential I/O matters more — LSM Trees provide higher write throughput by avoiding random page updates
Real-world examplePostgreSQL uses B-Trees as its default index type, and every primary key and unique constraint automatically creates a B-Tree index
Interview tipExplain why B-Trees are disk-optimized — the branching factor is chosen so each node fits in one disk page, minimizing I/O per lookup
Common mistakeConfusing B-Trees with binary trees — a binary tree has 2 children per node and is very deep; a B-Tree has hundreds of children and is very shallow
Key tradeoffB-Trees provide excellent read performance and range scan efficiency but require in-place page updates on writes, which is slower than LSM's sequential I/O

Why This Matters

B-Trees are arguably the most important data structure in databases. Every relational database uses them for primary key lookups, secondary index scans, and range queries. Understanding B-Trees means understanding why databases can find any row among billions in just 3-4 disk reads, how page splits and merges keep the tree balanced during writes, and why index key ordering matters for query performance. In system design interviews, B-Tree knowledge demonstrates that you understand storage engines beyond the API level — you know what happens when the database executes your query.

System architecture diagram for B-Trees showing how services, databases, and caches connect
System architecture for B-Trees

The Building Blocks

  • Nodes as Pages: Each B-Tree node is stored as a fixed-size disk page (4-16KB), containing sorted keys and pointers to child pages. This alignment eliminates partial reads and maximizes I/O efficiency.
  • High Fan-Out: With hundreds of keys per node, a B-Tree is extremely shallow — a 100-million-row table with 500-way branching needs only 3 levels, so any lookup takes 3 disk reads.
  • Page Splits & Merges: When a node overflows, it splits into two nodes and promotes a key to the parent. When underfull, nodes merge. These operations keep the tree balanced at O(log N) height.
  • Range Scans: Leaf nodes are linked together as a doubly-linked list, enabling efficient range scans — once you find the start key, you follow leaf pointers sequentially without traversing the tree again.
  • Buffer Pool: Frequently accessed B-Tree pages are cached in the database's buffer pool (shared memory), so hot indexes serve most lookups from RAM without any disk I/O.

Under the Hood

A B-Tree consists of internal nodes (containing keys and child pointers) and leaf nodes (containing keys and row pointers or inline data). In a B+ Tree variant (used by virtually all databases), all data resides in leaf nodes, and internal nodes contain only keys and child pointers for navigation. The leaf nodes are chained together in a linked list for efficient sequential access.

Step-by-step diagram showing how B-Trees processes a request from start to finish
How B-Trees works step by step

When a lookup occurs, the database starts at the root page (almost always cached in the buffer pool) and performs binary search within the node to find the correct child pointer. It follows this pointer to the next level and repeats. With a branching factor of 500, three levels cover 500^3 = 125 million keys. The top 1-2 levels are virtually always in the buffer pool, so a typical lookup requires just 1-2 physical disk reads.

Writes are more complex. To insert a key, the database traverses to the correct leaf page and inserts in sorted order. If the page is full, it splits: half the keys stay, half move to a new page, and a separator key is promoted to the parent. This split may cascade up if the parent is also full, but cascading splits are rare. Deletions mark keys as removed and pages are merged when they drop below a fill threshold. To prevent partial updates during crashes, all page modifications go through the write-ahead log (WAL), ensuring the tree is always in a consistent state on recovery.

How Companies Actually Do This

PostgreSQL Uses B+ Trees as the default index type for all primary keys, unique constraints, and CREATE INDEX statements, storing keys and row TIDs (tuple identifiers) in 8KB pages with a typical branching factor of several hundred.

Comparison table for B-Trees contrasting approaches, tradeoffs, and when to use each
Comparing key aspects of B-Trees

MySQL InnoDB Organizes all table data in a clustered B+ Tree index ordered by primary key, meaning the table itself is a B-Tree. Secondary indexes contain the primary key value as a row pointer, requiring a double lookup.

SQLite Uses B-Trees for both table storage (rows stored in B-Tree leaves) and indexes, making B-Tree performance critical for the billions of SQLite databases embedded in mobile devices worldwide.

Common Pitfalls

  1. Creating too many indexes — each B-Tree index must be updated on every INSERT/UPDATE/DELETE, and page splits on write-heavy tables add significant overhead and fragmentation
  2. Not understanding clustered vs non-clustered indexes — in InnoDB, the table IS a B-Tree ordered by primary key; choosing a random UUID as PK causes constant page splits and fragmentation
  3. Ignoring index bloat — over time, page splits and deletions leave B-Tree pages partially empty; periodic REINDEX (PostgreSQL) or OPTIMIZE TABLE (MySQL) reclaims space and improves cache efficiency
Data flow diagram for B-Trees showing how requests and responses move through the system
Data flow through B-Trees

Interview Questions Worth Practicing

  1. Why are B-Trees preferred over binary search trees or hash indexes for general-purpose database indexing?
  2. Explain what happens internally when you insert a key into a B-Tree that causes a page split.
  3. Compare B-Trees and LSM Trees for a workload that's 80% writes and 20% reads. Which would you choose and why?

The Tradeoffs

  • Read Latency vs Write Throughput: B-Trees provide fast single-key reads (1-3 disk I/O) but writes require in-place page updates and potential splits, which are slower than LSM's sequential appends
  • Space Efficiency vs Write Performance: B-Trees store each key in one location (minimal space amplification) but page splits create partially-empty pages that waste space until compacted
  • Simplicity vs Flexibility: B-Trees handle reads, writes, and range scans well in one structure, but specialized structures (LSM for writes, hash indexes for equality) outperform them in narrow use cases
Component diagram for B-Trees showing each building block and its responsibility
Key components of B-Trees

How to Explain This in an Interview

Here is how I would explain B-Trees in a system design interview:

Explain B-Trees by starting with the disk I/O problem: databases store far more data than fits in RAM, so minimizing disk reads per query is critical. A binary search tree with 1 billion keys would be 30 levels deep — 30 disk reads per lookup. A B-Tree solves this by packing hundreds of keys into each node sized to a disk page (4-16KB). With a branching factor of 500, 1 billion keys fit in just 4 levels — 4 disk reads maximum, and the top levels are cached in RAM, so it's usually 1-2 reads. Leaf nodes are linked together for efficient range scans. On writes, keys are inserted in sorted order within leaf pages; if a page is full, it splits and promotes a key upward. Mention that the B+ Tree variant (used by all major databases) stores data only in leaves, keeping internal nodes small and fan-out high. The key tradeoff vs LSM Trees: B-Trees excel at reads and range scans but LSM Trees win on write throughput.

Interview preparation checklist for B-Trees with key points to mention and mistakes to avoid
Interview tips for B-Trees

The Real-World Incident That Made This Famous

Understanding B-Trees became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about B-Trees can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering B-Trees because they learned the hard way that ignoring it leads to outages.

The key lesson from these incidents: B-Trees is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones. Every major outage report from the past decade involves at least one B-Trees-related design decision that was either implemented incorrectly or overlooked entirely during the initial architecture review.

Decision guide for when to choose B-Trees and when alternative approaches are better
When to use B-Trees

How Senior Engineers Think About This

Senior engineers approach B-Trees differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does B-Trees solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.

When evaluating B-Trees in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.

The key difference between junior and senior engineers when it comes to B-Trees: juniors focus on the happy path, while seniors design for what happens when things go wrong. They consider operational cost, team expertise, monitoring requirements, and how the decision will look six months from now when traffic has grown 10x.

Tradeoff analysis for B-Trees listing advantages, disadvantages, and real-world considerations
Advantages and disadvantages of B-Trees

Common Interview Mistakes

Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect B-Trees to real systems and real problems. Instead of reciting definitions, explain when and why you would use B-Trees in the system you are designing.

Mistake 2: Not discussing trade-offs. Every design decision involving B-Trees has trade-offs. Discuss what you gain and what you give up. Acknowledge the downsides and explain why the benefits outweigh them for your specific use case.

Mistake 3: Overcomplicating the solution. Start with the simplest approach to B-Trees that meets the requirements, then add complexity only when justified. Many candidates jump to complex implementations when a simpler solution would work perfectly.

Production deployment examples of B-Trees at companies like Netflix, Google, and Amazon
Real-world examples of B-Trees

Production Checklist

  • Define clear metrics for measuring the effectiveness of your B-Trees implementation
  • Set up monitoring and alerting that specifically tracks B-Trees-related failures
  • Document your B-Trees design decisions in Architecture Decision Records (ADRs)
  • Test failure scenarios related to B-Trees in staging before production deployment
  • Review and update your B-Trees implementation quarterly as system requirements evolve
  • Train new team members on the specific B-Trees patterns used in your system
  • Establish runbooks for common B-Trees-related incidents and recovery procedures

Practical Implementation for .NET Developers

In .NET, B-Trees are handled transparently by your database — SQL Server, PostgreSQL, and MySQL all use B-Tree indexes by default. When you create an index via EF Core's HasIndex() fluent API or [Index] attribute, the database creates a B-Tree. For SQL Server specifically, use Microsoft.Data.SqlClient to inspect index fragmentation via sys.dm_db_index_physical_stats and trigger rebuilds with ALTER INDEX REBUILD. For embedded scenarios, SQLite (via Microsoft.Data.Sqlite) uses B-Trees internally for all table and index storage.

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text
Log.Information("Processing {Operation} for {ResourceId}", operation, resourceId);

This gives you searchable, structured logs in Azure Monitor or Seq.