Skip to main content
SDMastery

Resources

30 curated resources — papers, case studies, and implementations.

Distributed Systems Papers

Paxos: The Part-Time Parliament

Lamport's foundational consensus algorithm that enables distributed systems to agree on a single value despite failures — the bedrock of modern.

7 min read

MapReduce: Simplified Data Processing on Large Clusters

Google's programming model for processing massive datasets in parallel across thousands of commodity machines — the paper that launched the big data era.

7 min read

The Google File System

Google's distributed file system designed for large-scale data-intensive applications — the blueprint for HDFS and modern distributed storage.

7 min read

Dynamo: Amazon's Highly Available Key-Value Store

Amazon's AP-leaning distributed key-value store that pioneered consistent hashing, vector clocks, and sloppy quorums — the blueprint for Cassandra and.

8 min read

Kafka: A Distributed Messaging System for Log Processing

LinkedIn's distributed commit log that redefined event streaming — the foundation of modern real-time data pipelines.

8 min read

Spanner: Google's Globally Distributed Database

Google's globally distributed SQL database that uses GPS and atomic clocks (TrueTime) to achieve external consistency across continents.

7 min read

Bigtable: A Distributed Storage System for Structured Data

Google's wide-column store that introduced the tablet-based architecture and SSTable storage format — the design behind HBase and Cassandra's data model.

8 min read

ZooKeeper: Wait-Free Coordination for Internet-Scale Systems

Yahoo's coordination service that provides a simple file-system-like API for distributed synchronization — the backbone of Hadoop, Kafka, and HBase.

7 min read

The Log-Structured Merge-Tree (LSM-Tree)

O'Neil's write-optimized storage structure that converts random writes into sequential I/O — the engine inside Cassandra, RocksDB, and LevelDB.

7 min read

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Google's Paxos-based distributed lock service that provides coarse-grained locking and reliable storage for small metadata — the inspiration for ZooKeeper.

7 min read

Engineering Case Studies

Discord Message Storage: From MongoDB to ScyllaDB

Discord's journey from MongoDB to Cassandra to ScyllaDB — how they scaled message storage for trillions of messages across millions of channels.

6 min read

Building In-Video Search at Netflix

How Netflix built a system to search within video content using computer vision, ML models, and temporal indexing for precise frame-level retrieval.

6 min read

How Canva Scaled Media Uploads from Zero to 50 Million Per Day

Canva's architecture evolution to handle 50 million daily media uploads — S3 storage, async processing pipelines, and thumbnail generation at scale.

7 min read

How Airbnb Avoids Double Payments in a Distributed System

Airbnb's idempotency framework for distributed payments — preventing double-charges across microservices with idempotency keys and state machines.

7 min read

Designing Stripe's Payments API for 10 Years of Evolution

How Stripe designed an API that evolved over a decade while maintaining backward compatibility — versioning strategies, error conventions, and design.

7 min read

Real-Time Messaging Architecture at Slack

How Slack delivers real-time messages to millions of concurrent users using WebSocket connections, message fanout, and channel-based routing.

7 min read

Uber's Schemaless: A Trip-Optimized Datastore

How Uber built Schemaless, a fault-tolerant, append-only datastore on top of MySQL that powers trip storage, handling millions of writes per second with.

7 min read

Twitter's Timeline Architecture

How Twitter delivers hundreds of millions of personalized home timelines per day using a hybrid fanout architecture that pre-computes timelines for most.

8 min read

How Instagram Scaled to 14 Million Users With 3 Engineers

The architectural decisions that allowed Instagram to scale from zero to 14 million users with a team of 3 engineers — PostgreSQL, Redis, Memcached, and a.

9 min read

Google Spanner and TrueTime

How Google Spanner uses TrueTime — GPS and atomic clock-synchronized timestamps — to provide globally consistent distributed transactions, solving the.

9 min read

Why LinkedIn Built Apache Kafka

The origin story of Apache Kafka — how LinkedIn's need to process billions of activity events per day led to the creation of a distributed commit log that.

9 min read

Code Implementations