Skip to main content
SDMastery

30-Day System Design Roadmap

A structured 30-day plan to learn system design fundamentals. Covers core concepts, databases, caching, APIs, and basic interview problems.

30-Day System Design Roadmap

This roadmap is designed for engineers with 2+ years of coding experience who want to build a solid system design foundation. Spend 1-2 hours per day.

Week 1: Core Foundations (Days 1-7)

DayTopicTime
1Scalability — vertical vs horizontal45 min
2Availability and Reliability45 min
3Latency vs Throughput vs Bandwidth45 min
4CAP Theorem60 min
5Single Points of Failure (SPOF)30 min
6Failover and Fault Tolerance45 min
7Review + Practice: explain each concept in your own words60 min

Week 2: Networking and APIs (Days 8-14)

DayTopicTime
8DNS and how it works45 min
9HTTP/HTTPS, TCP vs UDP45 min
10Load Balancing algorithms60 min
11What is an API, REST vs GraphQL45 min
12API Gateway and Rate Limiting45 min
13WebSockets and Webhooks45 min
14Review + Practice: Design a simple URL shortener90 min

Week 3: Databases and Caching (Days 15-21)

DayTopicTime
15SQL vs NoSQL databases60 min
16ACID transactions and indexes45 min
17Database sharding and replication60 min
18Caching strategies and eviction policies60 min
19Distributed caching and CDN45 min
20Message queues and Pub/Sub60 min
21Review + Practice: Design a distributed cache90 min

Week 4: Interview Practice (Days 22-30)

DayTopicTime
22System design interview framework60 min
23Practice: Design a URL Shortener (full solution)90 min
24Practice: Design a Load Balancer90 min
25Practice: Design a Notification Service90 min
26Practice: Design WhatsApp90 min
27Tradeoffs deep dive: consistency vs availability60 min
28Tradeoffs deep dive: batch vs stream processing60 min
29Practice: Design Netflix (full solution)90 min
30Review everything, mock interview with a friend120 min

Tips for Success

  • Do not just read — explain each concept out loud or write a summary
  • Draw architecture diagrams by hand for every practice problem
  • Focus on tradeoffs, not just the "right" answer
  • Use this site's concept pages for each topic listed above

Week 1: Fundamentals and Core Concepts

The first week is about building the vocabulary and mental models that everything else depends on. You cannot design a distributed system if you do not understand why distributed systems are hard in the first place.

Scalability is the starting point. Learn the difference between vertical scaling (bigger machines) and horizontal scaling (more machines). Vertical scaling is simpler but has a ceiling. Horizontal scaling is harder to implement but has no theoretical limit. Every system you design in an interview will need horizontal scaling at some point, so understand the implications: stateless services, shared-nothing architectures, and the need for load balancing. Read more about scalability.

Availability and reliability are often confused. Availability is the percentage of time a system is operational (measured in "nines" — 99.9% means about 8.7 hours of downtime per year). Reliability is the probability that a system performs correctly for a given period. A system can be highly available but unreliable if it serves incorrect results. Learn about redundancy, failover, and how companies like Netflix use Chaos Monkey to test availability in production. Understand availability.

CAP theorem is the most frequently tested theoretical concept. At minimum, you need to explain the three guarantees (consistency, availability, partition tolerance), why you can only have two during a network partition, and name real systems that make each choice. DynamoDB is AP, HBase is CP, and Google Spanner is "effectively CA" because Google's network rarely partitions. Deep dive into CAP theorem.

Latency vs throughput is the tradeoff you will reference in every single design. Latency is how long one request takes. Throughput is how many requests you handle per second. Optimizing for one often hurts the other. Batching improves throughput but increases latency. Caching improves both but introduces consistency challenges. Learn the latency numbers every engineer should know: L1 cache reference is 0.5ns, a round trip within the same data center is about 500 microseconds, and a cross-continent round trip is roughly 150ms. Explore latency vs throughput.

By the end of week 1, you should be able to explain each concept to a non-technical person and draw the basic diagrams from memory.

Week 2: Databases, Caching, and Storage

Week 2 is where you build the data layer knowledge that separates strong candidates from average ones. Every system design problem involves storing and retrieving data, so this week is high-leverage.

SQL vs NoSQL is not a religious debate — it is a tradeoff decision. SQL databases (PostgreSQL, MySQL) give you ACID transactions, complex joins, and a mature ecosystem. NoSQL databases (MongoDB, Cassandra, DynamoDB) give you flexible schemas, horizontal scaling, and better write throughput for specific access patterns. The answer in interviews is almost never "always use SQL" or "always use NoSQL." It depends on your access patterns, consistency requirements, and scale. Instagram uses PostgreSQL (SQL) for user data but might use Cassandra for analytics events. Compare SQL vs NoSQL.

Indexing and sharding are how you scale databases beyond a single machine. An index speeds up reads at the cost of slower writes and more storage. A B-tree index is the default for most databases. A hash index is faster for equality lookups but cannot do range queries. When a single database cannot handle the load, you shard: split data across multiple database instances based on a shard key. Instagram shards by user_id, Discord shards messages by channel_id + timestamp. The shard key decision is nearly irreversible, so choose carefully. Learn about database sharding.

Caching strategies can reduce database load by 80% or more. Learn the four main patterns: cache-aside (application checks cache first, fills on miss), write-through (writes go to cache and database simultaneously), write-behind (writes go to cache first, database asynchronously), and read-through (cache loads from database on miss). Each has different consistency and performance characteristics. Redis and Memcached are the two dominant caching technologies. Redis supports data structures (sorted sets, lists, hashes) while Memcached is simpler and faster for pure key-value caching. Study caching strategies.

CDN (Content Delivery Network) is a globally distributed cache for static and semi-static content. Cloudflare, Akazon CloudFront, and Akamai serve content from edge nodes close to users, reducing latency from hundreds of milliseconds to single digits. In interviews, mention CDN early for any system that serves images, videos, or static files. Netflix serves all video content through its Open Connect CDN, with servers placed directly inside ISP networks. Understand CDN.

Week 3: APIs, Networking, and Async Communication

Week 3 covers the communication layer — how services talk to each other and how clients talk to your system.

REST vs GraphQL is a common interview discussion point. REST is the default for most APIs: stateless, resource-based, uses HTTP methods. GraphQL lets clients request exactly the data they need, solving the over-fetching and under-fetching problems of REST. GitHub moved their public API from REST to GraphQL because clients were making 3-4 REST calls to assemble the data they needed for a single page. The tradeoff: GraphQL adds complexity on the server side (query parsing, authorization per field, N+1 query problems) and makes caching harder because every request is a POST with a different body. For system design interviews, default to REST unless the problem specifically involves complex, nested data that clients consume in different ways. Read about API design.

Rate limiting protects your system from abuse and ensures fair usage. Learn the four main algorithms: fixed window (simple but allows burst at window boundaries), sliding window (smoother but more memory), token bucket (allows controlled bursts, used by AWS and Stripe), and leaky bucket (smooths traffic to a constant rate). In interviews, rate limiting should be mentioned whenever you design a public-facing API. Implement it at the API gateway level using Redis to store counters. GitHub's API allows 5,000 requests per hour per authenticated user — this is a token bucket with hourly refill. Learn about rate limiting.

Load balancing distributes traffic across multiple servers. Layer 4 (transport) load balancers route based on IP and port — they are fast but cannot make routing decisions based on request content. Layer 7 (application) load balancers can route based on URL path, headers, or cookies — more flexible but slightly slower. Learn the algorithms: round-robin, weighted round-robin, least connections, and consistent hashing. AWS ALB (Application Load Balancer) is Layer 7, NLB (Network Load Balancer) is Layer 4. In interviews, always mention load balancers when you have multiple instances of a service. Explore load balancing.

Message queues enable asynchronous communication between services. Instead of service A calling service B directly (synchronous, coupling), service A puts a message on a queue and service B processes it when ready. This decouples services, handles traffic spikes (the queue absorbs bursts), and improves reliability (messages persist if a consumer is down). Kafka is the go-to for high-throughput event streaming (LinkedIn processes 7 trillion messages per day). RabbitMQ and SQS are better for traditional task queues. In interviews, use message queues whenever a task does not need an immediate response: sending emails, processing images, updating analytics. Study message queues.

Week 4: Practice Problems and Mock Interviews

Week 4 is where you synthesize everything from the previous three weeks by solving real system design problems. The gap between knowing concepts and applying them in a 45-minute interview is enormous, so deliberate practice is essential.

Start with easy problems. The URL shortener is the canonical starter problem because it touches all the fundamentals: API design (POST to create, GET to redirect), database choice (key-value store is sufficient), caching (URLs are read-heavy, cache popular ones), and scaling (hash-based sharding by short code). Practice this until you can walk through a complete design in 25 minutes, including estimation, API, data model, architecture, and tradeoffs.

Move to medium problems. Design WhatsApp tests real-time communication, presence detection, message delivery guarantees, and group messaging. Design Twitter tests feed generation, fanout-on-write vs fanout-on-read, celebrity handling, and timeline caching. These problems require you to make real tradeoffs and justify them.

How to practice effectively:

  1. Set a 45-minute timer and simulate the real interview. No pausing to Google things.
  2. Start every problem with 3-5 clarifying questions. Interviewers expect this and it buys you time to think.
  3. Do back-of-the-envelope estimation before designing. If the system handles 10,000 QPS, your architecture looks very different from one handling 10 million QPS.
  4. Draw the architecture diagram first, then drill into specific components. Never start with database schema.
  5. Speak your tradeoffs out loud: "I am choosing eventual consistency here because showing a slightly stale feed is acceptable, and it lets us serve requests faster."
  6. Practice with a partner who can ask follow-up questions. Solo practice builds knowledge but does not build the skill of thinking on your feet.

Record yourself solving problems and watch the recordings. You will notice verbal tics, long pauses, and moments where you get stuck that you did not notice in real time.

Common Mistakes When Learning System Design

Avoiding these pitfalls will save you weeks of wasted study time.

Mistake 1: Memorizing solutions instead of understanding principles. If you memorize "URL shortener uses base62 encoding and Redis cache," you will freeze when the interviewer asks "what if we need to support custom aliases?" or "how would you handle 10x the traffic?" Understanding why base62 is chosen (compact, URL-safe, 62^7 gives 3.5 trillion possibilities) lets you adapt to any variation.

Mistake 2: Skipping back-of-the-envelope estimation. Estimation is not busywork — it directly informs your design. If you calculate 100 writes/second, a single PostgreSQL instance is fine. If you calculate 100,000 writes/second, you need sharding, write-behind caching, or a different database entirely. Companies like Google and Meta explicitly test estimation skills.

Mistake 3: Not practicing out loud. System design interviews are conversations, not written exams. Many engineers who can write a perfect design document freeze when they have to explain their thinking verbally in real time. Practice explaining your designs to a rubber duck, a friend, or a camera. The ability to articulate tradeoffs clearly is half the interview.

Mistake 4: Over-engineering from the start. Do not jump to microservices, Kubernetes, and event sourcing for a system that serves 1,000 users. Start with the simplest architecture that meets the requirements, then explain how you would evolve it as scale increases. This shows maturity.

Mistake 5: Ignoring non-functional requirements. Many candidates design a system that works but never discuss availability, latency targets, consistency models, or failure scenarios. Senior-level answers always address: "What happens when this component fails?" and "What is the expected latency for this operation?"

Mistake 6: Studying breadth without depth. Knowing 20 concepts superficially is less valuable than deeply understanding 10. If you mention consistent hashing in an interview, you should be able to explain how it works, why it minimizes data movement during resharding, and how virtual nodes solve the uneven distribution problem.

How to Know You Are Ready

Readiness is not about knowing everything — it is about having a reliable process and enough depth to handle follow-up questions.

You are ready when you can do all of these:

  • Given any common system design problem, you can produce a reasonable high-level architecture in 5 minutes without looking anything up
  • You can estimate QPS, storage, and bandwidth for a system and your numbers are within an order of magnitude of reality
  • You can name 2-3 real tradeoffs for every major design decision (database choice, consistency model, sync vs async)
  • When someone asks "what happens if X fails?", you have an answer that includes detection, mitigation, and recovery
  • You can explain caching, sharding, load balancing, and message queues with specific implementation details, not just definitions
  • You have practiced at least 5-8 problems end-to-end with a timer

What "good enough" looks like: You do not need to know every system design topic. You need to demonstrate structured thinking, clear communication, and the ability to make and justify tradeoffs. An interviewer would rather hear a well-reasoned design for a simpler system than a hand-wavy design for a complex one. If you can clearly explain why you chose PostgreSQL over Cassandra for a specific use case, citing access patterns and consistency requirements, you are demonstrating the thinking they want to see.

The biggest sign you are not ready: you cannot explain your design decisions without referring to notes. If you need to look up whether to use a message queue or direct API call, you need more practice. If you can explain the tradeoff from memory and apply it to the specific problem, you are ready.