Design a Web Crawler
Design a distributed web crawler with URL frontier, politeness policies, content deduplication, and distributed coordination.
Problem Statement
Design a distributed web crawler that systematically downloads web pages from the internet for indexing. The crawler must discover new URLs, respect robots.txt and rate limits (politeness), deduplicate content to avoid re-processing identical pages, prioritize important pages, and scale to crawl billions of pages per month.
Requirements
Functional
- Crawl web pages starting from a seed set of URLs; extract and follow links to discover new pages
- Respect robots.txt: honor crawl-delay directives and disallowed paths per domain
- Deduplicate: detect and skip pages with identical or near-identical content (Simhash)
- Prioritize: crawl high-value pages (news sites, popular domains) more frequently
Non-Functional
- Throughput: 1B pages/month (~400 pages/second sustained)
- Politeness: Max 1 request per domain per second (configurable per domain)
- Freshness: Important pages re-crawled within hours; long-tail pages within weeks
- Fault tolerance: Crawler nodes can fail without losing progress or re-crawling the same pages
Core Architecture
-
URL Frontier -- A priority queue of URLs to be crawled, partitioned by domain. Each domain has its own sub-queue with a scheduled "next allowed crawl time" enforcing politeness. The frontier uses a two-level priority: (a) front queue selects by priority (domain importance, freshness need), (b) back queue routes to the per-domain politeness queue. Backed by a combination of Redis (hot domains) and on-disk queue (long tail).
-
Fetcher Pool -- Distributed HTTP fetcher nodes. Each node pulls URLs from the frontier, respects the domain's crawl delay, downloads the page (with timeout and max size limits), and passes the raw HTML to the processing pipeline. Uses consistent hashing to assign domains to fetcher nodes -- ensuring one node handles one domain (prevents politeness violations from concurrent fetchers).
-
Content Processor -- Parses downloaded HTML: extracts text content, title, metadata, and outgoing links. Computes a Simhash (locality-sensitive hash) of the page content. Compares against a Simhash index to detect near-duplicates (Hamming distance <3 = duplicate). Non-duplicate pages are sent to the indexer; extracted URLs are sent back to the URL frontier.
- URL Deduplication Store -- A Bloom filter (or Cuckoo filter) containing all previously-seen URLs. Before adding a URL to the frontier, check the filter. False positive rate of 1% is acceptable (worst case: we skip a valid URL, it will be discovered via another link). At 10B URLs, a Bloom filter requires ~1.2 GB of memory.
Database Choice
Redis for the hot portion of the URL frontier (active domains being crawled) and robots.txt cache (per domain, TTL 24 hours). RocksDB (on-disk) for the long-tail URL frontier and the Simhash index (10B entries). Kafka for the pipeline: fetcher -> processor -> indexer, with URLs-to-crawl as a separate topic flowing back to the frontier. S3 for raw HTML storage (archive of crawled pages).
Key API Endpoints
POST /api/v1/crawl/seeds
-> Body: \{ urls: ["https://example.com", "https://news.site.com"], priority: "high" \}
-> Returns: \{ accepted: 2, already_known: 0 \}
GET /api/v1/crawl/status
-> Returns: \{ urls_in_frontier: 500M, pages_crawled_today: 35M, active_fetchers: 200, domains_active: 50K \}
GET /api/v1/crawl/domain/\{domain\}/stats
-> Returns: \{ pages_crawled: 15000, last_crawl: "...", crawl_delay_sec: 2, robots_txt_cached: true \}
Scaling Insight
Domain-based partitioning with consistent hashing is the linchpin of the architecture. By assigning each domain to exactly one fetcher node (via consistent hashing of the domain name), you solve two problems simultaneously: (1) politeness is naturally enforced because only one node sends requests to each domain, and (2) robots.txt caching is efficient because each node only needs to cache robots.txt for its assigned domains. When a fetcher node fails, its domains are reassigned to other nodes via the hash ring, and crawling resumes from the frontier.
Key Tradeoffs
| Decision | Option A | Option B | Chosen |
|---|---|---|---|
| URL dedup | Hash set in database (exact) | Bloom filter (probabilistic) | Bloom filter -- 1000x less memory (1.2 GB vs 1.2 TB), 1% false positive acceptable |
| Content dedup | Exact hash (MD5) | Simhash (near-duplicate) | Simhash -- catches mirror sites and minor variations, much better dedup quality |
| Frontier storage | All in Redis (fast) | Redis + RocksDB (hybrid) | Hybrid -- Redis for hot domains, RocksDB for 500M+ long-tail URLs that do not fit in memory |
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
System-Specific Clarifying Questions
Before designing Web Crawler, ask questions specific to THIS system:
- Who are the primary users? Understanding the user base shapes every technical decision — consumer apps have different requirements than enterprise B2B systems.
- What is the read-to-write ratio? This determines whether you optimize for fast reads (caching, denormalization) or fast writes (write-ahead logs, async processing).
- What is the geographic distribution? Users in one country vs. global users fundamentally changes your data replication and CDN strategy.
- What is the acceptable latency? Some features need sub-100ms responses, others can tolerate seconds. This determines your caching and architecture strategy.
- What is the consistency requirement? Some data (payments, inventory) needs strong consistency. Other data (social feeds, recommendations) can be eventually consistent.
Architecture Deep Dive
The architecture for Web Crawler should be designed around the specific access patterns of the system. Do not apply generic templates — every system has unique hotspots, bottlenecks, and scaling challenges.
Write Path: How does data enter the system? Is it bursty (event-driven, flash sales) or steady (sensor data, logs)? Bursty writes need queuing and backpressure. Steady writes can go directly to the database.
Read Path: How is data consumed? Is it fan-out (one write, many reads like social feeds) or point lookups (one read for specific data like user profiles)? Fan-out reads benefit from pre-computation and caching. Point lookups benefit from efficient indexing.
Hot Spots: Where are the bottlenecks? For Web Crawler, identify the component that will fail first under load and design mitigation strategies: caching, sharding, rate limiting, or async processing.