hard8 min readUpdated 2026-06-08

Design a Web Crawler

Design a distributed web crawler with URL frontier, politeness policies, content deduplication, and distributed coordination.

Problem Statement

Design a distributed web crawler that systematically downloads web pages from the internet for indexing. The crawler must discover new URLs, respect robots.txt and rate limits (politeness), deduplicate content to avoid re-processing identical pages, prioritize important pages, and scale to crawl billions of pages per month.

Requirements

System architecture diagram for Design a Web Crawler showing how services, databases, and caches connect — System architecture for Design a Web Crawler

Functional

Crawl web pages starting from a seed set of URLs; extract and follow links to discover new pages
Respect robots.txt: honor crawl-delay directives and disallowed paths per domain
Deduplicate: detect and skip pages with identical or near-identical content (Simhash)
Prioritize: crawl high-value pages (news sites, popular domains) more frequently

Non-Functional

Throughput: 1B pages/month (~400 pages/second sustained)
Politeness: Max 1 request per domain per second (configurable per domain)
Freshness: Important pages re-crawled within hours; long-tail pages within weeks
Fault tolerance: Crawler nodes can fail without losing progress or re-crawling the same pages

Core Architecture

Step-by-step diagram showing how Design a Web Crawler processes a request from start to finish — How Design a Web Crawler works step by step

URL Frontier -- A priority queue of URLs to be crawled, partitioned by domain. Each domain has its own sub-queue with a scheduled "next allowed crawl time" enforcing politeness. The frontier uses a two-level priority: (a) front queue selects by priority (domain importance, freshness need), (b) back queue routes to the per-domain politeness queue. Backed by a combination of Redis (hot domains) and on-disk queue (long tail).
Fetcher Pool -- Distributed HTTP fetcher nodes. Each node pulls URLs from the frontier, respects the domain's crawl delay, downloads the page (with timeout and max size limits), and passes the raw HTML to the processing pipeline. Uses consistent hashing to assign domains to fetcher nodes -- ensuring one node handles one domain (prevents politeness violations from concurrent fetchers).
Content Processor -- Parses downloaded HTML: extracts text content, title, metadata, and outgoing links. Computes a Simhash (locality-sensitive hash) of the page content. Compares against a Simhash index to detect near-duplicates (Hamming distance <3 = duplicate). Non-duplicate pages are sent to the indexer; extracted URLs are sent back to the URL frontier.

Data flow diagram for Design a Web Crawler showing how requests and responses move through the system — Data flow through Design a Web Crawler

URL Deduplication Store -- A Bloom filter (or Cuckoo filter) containing all previously-seen URLs. Before adding a URL to the frontier, check the filter. False positive rate of 1% is acceptable (worst case: we skip a valid URL, it will be discovered via another link). At 10B URLs, a Bloom filter requires ~1.2 GB of memory.

Database Choice

Redis for the hot portion of the URL frontier (active domains being crawled) and robots.txt cache (per domain, TTL 24 hours). RocksDB (on-disk) for the long-tail URL frontier and the Simhash index (10B entries). Kafka for the pipeline: fetcher -> processor -> indexer, with URLs-to-crawl as a separate topic flowing back to the frontier. S3 for raw HTML storage (archive of crawled pages).

Interview preparation checklist for Design a Web Crawler with key points to mention and mistakes to avoid — Interview tips for Design a Web Crawler

Key API Endpoints

text

POST /api/v1/crawl/seeds
  -> Body: \{ urls: ["https://example.com", "https://news.site.com"], priority: "high" \}
  -> Returns: \{ accepted: 2, already_known: 0 \}

GET /api/v1/crawl/status
  -> Returns: \{ urls_in_frontier: 500M, pages_crawled_today: 35M, active_fetchers: 200, domains_active: 50K \}

GET /api/v1/crawl/domain/\{domain\}/stats
  -> Returns: \{ pages_crawled: 15000, last_crawl: "...", crawl_delay_sec: 2, robots_txt_cached: true \}

Scaling Insight

Domain-based partitioning with consistent hashing is the linchpin of the architecture. By assigning each domain to exactly one fetcher node (via consistent hashing of the domain name), you solve two problems simultaneously: (1) politeness is naturally enforced because only one node sends requests to each domain, and (2) robots.txt caching is efficient because each node only needs to cache robots.txt for its assigned domains. When a fetcher node fails, its domains are reassigned to other nodes via the hash ring, and crawling resumes from the frontier.

Decision guide for when to choose Design a Web Crawler and when alternative approaches are better — When to use Design a Web Crawler

Key Tradeoffs

Decision	Option A	Option B	Chosen
URL dedup	Hash set in database (exact)	Bloom filter (probabilistic)	Bloom filter -- 1000x less memory (1.2 GB vs 1.2 TB), 1% false positive acceptable
Content dedup	Exact hash (MD5)	Simhash (near-duplicate)	Simhash -- catches mirror sites and minor variations, much better dedup quality
Frontier storage	All in Redis (fast)	Redis + RocksDB (hybrid)	Hybrid -- Redis for hot domains, RocksDB for 500M+ long-tail URLs that do not fit in memory

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

Tradeoff analysis for Design a Web Crawler listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of Design a Web Crawler

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Production deployment examples of Design a Web Crawler at companies like Netflix, Google, and Amazon — Real-world examples of Design a Web Crawler

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

Comparison table for Design a Web Crawler contrasting approaches, tradeoffs, and when to use each — Comparing key aspects of Design a Web Crawler

System-Specific Clarifying Questions

Before designing Web Crawler, ask questions specific to THIS system:

Component diagram for Design a Web Crawler showing each building block and its responsibility — Key components of Design a Web Crawler

Who are the primary users? Understanding the user base shapes every technical decision — consumer apps have different requirements than enterprise B2B systems.
What is the read-to-write ratio? This determines whether you optimize for fast reads (caching, denormalization) or fast writes (write-ahead logs, async processing).
What is the geographic distribution? Users in one country vs. global users fundamentally changes your data replication and CDN strategy.
What is the acceptable latency? Some features need sub-100ms responses, others can tolerate seconds. This determines your caching and architecture strategy.
What is the consistency requirement? Some data (payments, inventory) needs strong consistency. Other data (social feeds, recommendations) can be eventually consistent.

Architecture Deep Dive

The architecture for Web Crawler should be designed around the specific access patterns of the system. Do not apply generic templates — every system has unique hotspots, bottlenecks, and scaling challenges.

Write Path: How does data enter the system? Is it bursty (event-driven, flash sales) or steady (sensor data, logs)? Bursty writes need queuing and backpressure. Steady writes can go directly to the database.

Read Path: How is data consumed? Is it fan-out (one write, many reads like social feeds) or point lookups (one read for specific data like user profiles)? Fan-out reads benefit from pre-computation and caching. Point lookups benefit from efficient indexing.

Hot Spots: Where are the bottlenecks? For Web Crawler, identify the component that will fail first under load and design mitigation strategies: caching, sharding, rate limiting, or async processing.

Sources

Design a Web Crawler -- Reference
Source: System-Design-Overview

Reference

Reference Solutionvideo