Resources
30 curated resources — papers, case studies, and implementations.
Distributed Systems Papers
Paxos: The Part-Time Parliament
Lamport's foundational consensus algorithm that enables distributed systems to agree on a single value despite failures — the bedrock of modern.
MapReduce: Simplified Data Processing on Large Clusters
Google's programming model for processing massive datasets in parallel across thousands of commodity machines — the paper that launched the big data era.
The Google File System
Google's distributed file system designed for large-scale data-intensive applications — the blueprint for HDFS and modern distributed storage.
Dynamo: Amazon's Highly Available Key-Value Store
Amazon's AP-leaning distributed key-value store that pioneered consistent hashing, vector clocks, and sloppy quorums — the blueprint for Cassandra and.
Kafka: A Distributed Messaging System for Log Processing
LinkedIn's distributed commit log that redefined event streaming — the foundation of modern real-time data pipelines.
Spanner: Google's Globally Distributed Database
Google's globally distributed SQL database that uses GPS and atomic clocks (TrueTime) to achieve external consistency across continents.
Bigtable: A Distributed Storage System for Structured Data
Google's wide-column store that introduced the tablet-based architecture and SSTable storage format — the design behind HBase and Cassandra's data model.
ZooKeeper: Wait-Free Coordination for Internet-Scale Systems
Yahoo's coordination service that provides a simple file-system-like API for distributed synchronization — the backbone of Hadoop, Kafka, and HBase.
The Log-Structured Merge-Tree (LSM-Tree)
O'Neil's write-optimized storage structure that converts random writes into sequential I/O — the engine inside Cassandra, RocksDB, and LevelDB.
The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google's Paxos-based distributed lock service that provides coarse-grained locking and reliable storage for small metadata — the inspiration for ZooKeeper.
Engineering Case Studies
Discord Message Storage: From MongoDB to ScyllaDB
Discord's journey from MongoDB to Cassandra to ScyllaDB — how they scaled message storage for trillions of messages across millions of channels.
Building In-Video Search at Netflix
How Netflix built a system to search within video content using computer vision, ML models, and temporal indexing for precise frame-level retrieval.
How Canva Scaled Media Uploads from Zero to 50 Million Per Day
Canva's architecture evolution to handle 50 million daily media uploads — S3 storage, async processing pipelines, and thumbnail generation at scale.
How Airbnb Avoids Double Payments in a Distributed System
Airbnb's idempotency framework for distributed payments — preventing double-charges across microservices with idempotency keys and state machines.
Designing Stripe's Payments API for 10 Years of Evolution
How Stripe designed an API that evolved over a decade while maintaining backward compatibility — versioning strategies, error conventions, and design.
Real-Time Messaging Architecture at Slack
How Slack delivers real-time messages to millions of concurrent users using WebSocket connections, message fanout, and channel-based routing.
Uber's Schemaless: A Trip-Optimized Datastore
How Uber built Schemaless, a fault-tolerant, append-only datastore on top of MySQL that powers trip storage, handling millions of writes per second with.
Twitter's Timeline Architecture
How Twitter delivers hundreds of millions of personalized home timelines per day using a hybrid fanout architecture that pre-computes timelines for most.
How Instagram Scaled to 14 Million Users With 3 Engineers
The architectural decisions that allowed Instagram to scale from zero to 14 million users with a team of 3 engineers — PostgreSQL, Redis, Memcached, and a.
Google Spanner and TrueTime
How Google Spanner uses TrueTime — GPS and atomic clock-synchronized timestamps — to provide globally consistent distributed transactions, solving the.
Why LinkedIn Built Apache Kafka
The origin story of Apache Kafka — how LinkedIn's need to process billions of activity events per day led to the creation of a distributed commit log that.
Code Implementations
Consistent Hashing Implementation
Working Java and Python implementations of consistent hashing with virtual nodes, hash ring, and key distribution.
Load Balancing Algorithms
Java and Python implementations of 5 load balancing algorithms: Round Robin, Weighted Round Robin, Least Connections, Least Response Time, and IP Hash.
Rate Limiting Algorithms
Java and Python implementations of 5 rate limiting algorithms: Token Bucket, Leaky Bucket, Fixed Window, Sliding Window Log, and Sliding Window Counter.