Distributed Systems Reading List
A curated reading list of the most influential distributed systems papers, organized by topic — from foundational theory to modern production systems.
Distributed Systems Reading List
This reading list covers the papers, blog posts, and talks that shaped modern distributed systems. It is organized by topic, from foundational theory to production systems. Each entry includes the key insight and why it matters for practicing engineers.
Reading academic papers is a skill. Do not try to understand every proof on first pass. Read the abstract, introduction, and conclusion first. Then read the system design sections. Return to the proofs only if you need to understand the guarantees deeply.
Tier 1: Essential Papers (Read These First)
These papers introduced ideas that every distributed system uses today. If you read nothing else, read these.
"Dynamo: Amazon's Highly Available Key-Value Store" (2007)
Authors: DeCandia et al., Amazon
Key insight: You can build a highly available key-value store by combining consistent hashing, sloppy quorums, vector clocks, and anti-entropy. The system explicitly chooses availability over consistency during partitions.
Why it matters: Dynamo's design influenced Cassandra, Riak, Voldemort, and DynamoDB. The consistent-hashing-with-virtual-nodes pattern is ubiquitous. The paper teaches you how to think about AP systems.
Read time: 2-3 hours
"The Google File System" (2003)
Authors: Ghemawat, Gobioff, Leung (Google)
Key insight: A distributed file system optimized for large sequential reads and appends (not random writes) can be built with a single master for metadata and many chunkservers for data. Relaxing the consistency model (allowing duplicate records, relying on application-level checksums) simplifies the design.
Why it matters: GFS established the pattern for HDFS and influenced every subsequent distributed storage system. The paper teaches you that the workload determines the architecture — GFS would be terrible for a database, but it is perfect for batch processing.
Read time: 2-3 hours
"MapReduce: Simplified Data Processing on Large Clusters" (2004)
Authors: Dean, Ghemawat (Google)
Key insight: Complex distributed computations can be expressed as two simple functions: Map (transform input records) and Reduce (aggregate results). The framework handles distribution, fault tolerance, and scheduling transparently.
Why it matters: MapReduce launched the big data era. While it has been superseded by Spark and Flink for most workloads, the programming model (transform then aggregate) appears in every data processing system. Understanding MapReduce is prerequisite to understanding its successors.
Read time: 1-2 hours
"Bigtable: A Distributed Storage System for Structured Data" (2006)
Authors: Chang et al. (Google)
Key insight: A wide-column store built on GFS and Chubby (a lock service) that handles petabytes of data across thousands of servers. The data model (row key, column family, timestamp) is optimized for sparse, semi-structured data with high write throughput.
Why it matters: Bigtable's design directly influenced HBase and Cassandra (Cassandra combines Bigtable's data model with Dynamo's distribution). The paper teaches you how Google thinks about storage at planetary scale.
Read time: 2-3 hours
Tier 2: Consensus and Coordination
These papers address the hardest problem in distributed systems: getting multiple machines to agree on something.
"Paxos Made Simple" (2001)
Author: Leslie Lamport
Key insight: The Paxos algorithm achieves consensus (agreement on a single value) among a group of unreliable processes. A majority quorum ensures that at most one value is chosen, even if some processes crash.
Why it matters: Paxos is the theoretical foundation of consensus in distributed systems. Google Spanner, Chubby, and Megastore use Paxos. Understanding Paxos is essential for understanding why distributed databases make the tradeoffs they do.
Reading tip: Read Lamport's "Paxos Made Simple" (2001), not the original "Part-Time Parliament" (1998). The original is written as an archaeological story and is unnecessarily difficult.
Read time: 1-2 hours
"In Search of an Understandable Consensus Algorithm" (Raft, 2014)
Authors: Ongaro, Ousterhout (Stanford)
Key insight: Raft provides the same guarantees as Paxos but is explicitly designed to be understandable. It decomposes consensus into leader election, log replication, and safety — three sub-problems that can be reasoned about independently.
Why it matters: Raft is used in etcd (which powers Kubernetes), CockroachDB, TiKV, and Consul. If you build or operate any of these systems, you need to understand Raft. The paper is also one of the best-written systems papers — it can teach you how to communicate technical ideas clearly.
Read time: 2-3 hours (worth reading thoroughly)
"ZooKeeper: Wait-free Coordination for Internet-Scale Systems" (2010)
Authors: Hunt et al. (Yahoo)
Key insight: A coordination service that provides a hierarchical namespace (like a file system) with strong ordering guarantees. Clients can create ephemeral nodes, watch for changes, and use sequential nodes for distributed locking and leader election.
Why it matters: ZooKeeper was the coordination backbone of Hadoop, Kafka (pre-KRaft), HBase, and Solr. Even as systems move away from ZooKeeper, the coordination patterns (leader election via ephemeral sequential nodes, distributed configuration, group membership) remain fundamental.
Read time: 2 hours
Tier 3: Transactions and Consistency
"Spanner: Google's Globally-Distributed Database" (2012)
Authors: Corbett et al. (Google)
Key insight: By using GPS receivers and atomic clocks to bound clock uncertainty (TrueTime), you can implement globally consistent distributed transactions with external consistency — the strongest consistency guarantee possible.
Why it matters: Spanner proves that you do not have to choose between global distribution and strong consistency, if you invest in clock infrastructure. Cloud Spanner makes this available as a service.
Read time: 3-4 hours (dense paper)
"A Critique of the CAP Theorem" (2015)
Author: Martin Kleppmann
Key insight: The CAP theorem, as commonly stated, is misleading. It conflates network partitions (which you cannot prevent) with a binary choice between "consistency" and "availability" (which oversimplifies reality). Real systems operate on a spectrum of consistency models.
Why it matters: This paper corrects the most common misunderstanding in distributed systems. After reading it, you will use the CAP theorem correctly in interviews and design discussions.
Read time: 1 hour
Tier 4: Modern Production Systems
"Kafka: A Distributed Messaging System for Log Processing" (2011)
Authors: Kreps, Narkhede, Rao (LinkedIn)
Key insight: By treating a message broker as an append-only commit log rather than a queue, you get higher throughput (sequential I/O), message replay (consumers track their own offset), and decoupled consumers (multiple consumer groups read independently).
Why it matters: Kafka redefined how companies build data pipelines. It is the backbone of event-driven architectures at LinkedIn, Uber, Netflix, and most large-scale systems.
Read time: 1-2 hours
"TAO: Facebook's Distributed Data Store for the Social Graph" (2013)
Authors: Bronson et al. (Facebook)
Key insight: A read-optimized cache for the social graph, built on top of MySQL and Memcached. TAO models data as objects (nodes) and associations (edges), with a cache that handles billions of reads per second while maintaining consistency through a write-through cache protocol.
Why it matters: TAO shows how to build a specialized caching layer for a specific access pattern (graph traversals). The paper teaches you when a general-purpose database is insufficient and how to layer caching for extreme read throughput.
Read time: 2 hours
"Scaling Memcache at Facebook" (2013)
Authors: Nishtala et al. (Facebook)
Key insight: At Facebook's scale, Memcached is not just a caching layer — it is a critical distributed system. The paper covers thundering herd mitigation (lease mechanism), cross-region consistency, cache invalidation via McSqueal (a MySQL-to-Memcached invalidation daemon), and cold cache warming.
Why it matters: This paper is the definitive guide to operating a caching layer at extreme scale. Every caching decision you make in system design interviews can be grounded in Facebook's lessons.
Read time: 2 hours
Tier 5: Supplementary Reading
Blog Posts and Talks
- "The Log: What every software engineer should know" (Jay Kreps, LinkedIn) — The theoretical foundation for Kafka and event sourcing.
- "Turning the Database Inside Out" (Martin Kleppmann, Strange Loop 2014) — How to build systems from event logs and derived views.
- "Life Beyond Distributed Transactions" (Pat Helland, 2007) — Why distributed transactions do not scale and what to do instead.
- "How Discord Stores Trillions of Messages" (Discord Engineering Blog) — A practical migration story from MongoDB to Cassandra to ScyllaDB.
Books
- Designing Data-Intensive Applications (Martin Kleppmann) — The best single resource that synthesizes these papers into a coherent narrative.
- Database Internals (Alex Petrov) — Deep dive into storage engine internals, B-trees, LSM trees, and distributed database algorithms.
- Understanding Distributed Systems (Roberto Vitillo) — A more accessible introduction for engineers who find DDIA too dense.
Reading Schedule
For engineers with limited time, a practical schedule:
| Weeks | Papers | Time Investment |
|---|---|---|
| 1-2 | Dynamo, GFS | 3 hours/week |
| 3-4 | MapReduce, Bigtable | 3 hours/week |
| 5-6 | Raft, ZooKeeper | 3 hours/week |
| 7-8 | Spanner, Kafka paper | 3 hours/week |
| 9-10 | TAO, Scaling Memcache | 3 hours/week |
| 11-12 | CAP Critique, supplementary blog posts | 2 hours/week |
After completing this list, you will have a deep understanding of the principles that power modern distributed systems — not from summaries or tutorials, but from the engineers who built these systems.