Skip to main content
SDMastery
hard10 min readUpdated 2026-06-03

Design Dropbox

Design Dropbox with file sync, chunking, deduplication, conflict resolution, and delta sync. Covers the sync protocol and metadata management.

Design Dropbox system design overview showing key components and metrics
High-level overview of Design Dropbox

Problem Statement

Design a cloud file storage and synchronization service like Dropbox. Users store files in the cloud, sync them across multiple devices in near real-time, share files/folders with others, and access file version history. The system must handle large files efficiently using chunking, minimize bandwidth with delta sync, and resolve conflicts when the same file is edited on two devices offline.

Requirements

Functional

  • Upload/download files; sync file changes across all connected devices in near real-time
  • Chunk large files (4 MB chunks) for resumable uploads and efficient delta sync
  • Content-based deduplication: identical file chunks stored only once across all users
  • Conflict resolution: when the same file is edited on two offline devices, create a conflict copy
Design Dropbox system architecture with service components and data flow
System architecture for Design Dropbox

Non-Functional

  • Sync latency: Changes propagated to online devices within 5 seconds
  • Storage efficiency: Deduplication reduces storage by 50%+ across all users
  • Scale: 700M users, 500B files, 1.2 exabytes of data
  • Reliability: No data loss -- files replicated across 3+ data centers

Core Architecture

  1. Chunking Engine (Client-side) -- Splits files into 4 MB chunks using content-defined chunking (Rabin fingerprinting). This means inserting 1 byte at the start of a 1 GB file only changes 1-2 chunks, not all of them. Each chunk is hashed (SHA-256). Before uploading, the client sends chunk hashes to the server -- only missing chunks are uploaded (delta sync).

  2. Metadata Service -- Stores the file tree: files, folders, versions, and the mapping from files to ordered lists of chunk hashes. Uses PostgreSQL sharded by user_id. Each file edit creates a new version entry pointing to the new set of chunk hashes. Provides the "diff" API: given a client's known version, return all changes since.

Step-by-step diagram showing how Design Dropbox works in practice
How Design Dropbox works step by step
  1. Block Storage Service -- Stores raw chunk data in S3/GCS, keyed by chunk hash (SHA-256). Content-addressable storage means identical chunks are naturally deduplicated -- if 1M users have the same PDF, it is stored once. Chunks are encrypted at rest (AES-256) with per-user keys managed by a KMS.

  2. Sync Service -- Maintains a long-polling or WebSocket connection per online client. When one device uploads changes, the metadata service publishes an event. The sync service notifies all other devices of the same user to pull the updated file metadata and download any new chunks.

  3. Conflict Resolver -- If two devices edit the same file while offline, both upload their changes with the same parent version. The server detects the conflict (two writes with the same parent version), keeps one as the primary (first to sync), and saves the other as "filename (conflicted copy - Device - Date)". The user manually resolves.

Database Choice

Data flow diagram for Design Dropbox showing request and response paths
Data flow through Design Dropbox

PostgreSQL (sharded by user_id) for file metadata, folder structure, sharing permissions, and version history. Sharding by user_id ensures all of a user's files are on the same shard for fast tree queries. S3/GCS for chunk blob storage (content-addressed by SHA-256 hash). Redis for online presence (which devices are connected) and change notification pub/sub. Kafka for change events between metadata service and sync service.

Key API Endpoints

text
POST /api/v1/files/upload_session
  -> Body: \{ path: "/docs/report.pdf", chunk_hashes: ["abc123...", "def456..."] \}
  -> Returns: \{ upload_id: "UP-789", chunks_needed: ["def456..."] \} (only missing chunks)

PUT /api/v1/files/upload_session/\{upload_id\}/chunk
  -> Body: <binary chunk data>
  -> Headers: X-Chunk-Hash: def456...

GET /api/v1/files/changes?cursor=\{version_id\}
  -> Returns: \{ changes: [\{ path: "/docs/report.pdf", action: "modified", version: 42, chunk_hashes: [...] \}], new_cursor: "..." \}

Scaling Insight

Content-defined chunking with deduplication is the most impactful optimization. Rabin fingerprinting sets chunk boundaries based on content (not fixed offsets), so inserting data at the start of a file only affects the first 1-2 chunks. Combined with content-addressed storage (SHA-256 hash as key), identical chunks across all users are stored once. At Dropbox's scale, this reduces storage from 1.2 EB to ~500 PB -- saving hundreds of millions of dollars in storage costs annually.

Interview tips for Design Dropbox system design questions
Interview tips for Design Dropbox

Key Tradeoffs

DecisionOption AOption BChosen
ChunkingFixed-size (simple)Content-defined (Rabin fingerprint)Content-defined -- minimizes re-upload on edits, better deduplication across files
Sync protocolPoll for changes (simple)Long-poll / WebSocket (real-time)Long-poll -- near-instant sync, lower server load than polling
Conflict resolutionLast-write-wins (data loss risk)Conflict copy (user decides)Conflict copy -- preserves both versions, no data loss

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Decision guide showing when to use Design Dropbox and when to avoid
When to use Design Dropbox

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
Pros and cons analysis of Design Dropbox for system design decisions
Advantages and disadvantages of Design Dropbox

This gives you searchable, structured logs in Azure Monitor or Seq.

Deep-Dive: Clarifying Questions for Dropbox

  1. How does file sync work? When a user modifies a file on one device, how quickly does it appear on other devices? Dropbox targets under 5 seconds for small files.
  2. Do we need file chunking? Large files (1 GB+) should be split into chunks (typically 4 MB) so that modifying one part of a large file only uploads the changed chunks, not the entire file.
  3. How do we handle sync conflicts? Two users editing the same file on different devices simultaneously. Dropbox creates a "conflicted copy" rather than silently overwriting.
  4. Do we need deduplication? If 1,000 users upload the same 1 GB file, we should store it once. Content-addressable storage (hash the file content, use the hash as the key) enables this.
  5. What about bandwidth optimization? Users on slow connections should still be able to sync. Delta sync (only upload the bytes that changed) is critical.
  6. How do we handle versioning? Keep previous versions of files so users can restore accidentally deleted or overwritten content.
Real-world companies using Design Dropbox in production systems
Real-world examples of Design Dropbox

Specific Functional Requirements

  1. File Upload and Download: Upload files up to 50 GB with resumable uploads for large files
  2. Real-Time Sync: Changes on one device appear on all linked devices within seconds
  3. File Chunking: Split files into 4 MB chunks for efficient delta sync — only upload changed chunks
  4. Deduplication: Content-addressable storage using SHA-256 hashes to eliminate duplicate file storage
  5. Conflict Resolution: Detect concurrent edits and create "conflicted copies" with clear naming
  6. Version History: Keep 30-180 days of file versions depending on plan, with ability to restore any version
  7. Sharing: Share files and folders via links or direct sharing with permission levels (view, edit)

Specific API Endpoints

text
POST /api/v2/files/upload_session/start
  Response: &#123; "session_id": "sess_abc123" &#125;

PUT /api/v2/files/upload_session/append
  Headers: &#123; "Dropbox-API-Arg": &#123; "session_id": "sess_abc123", "offset": 0 &#125; &#125;
  Body: [4 MB chunk binary data]
  Response: HTTP 200

POST /api/v2/files/upload_session/finish
  Body: &#123; "session_id": "sess_abc123", "commit": &#123; "path": "/documents/report.pdf", "mode": "update", "content_hash": "sha256:abc..." &#125; &#125;
  Response: &#123; "id": "id:abc123", "path": "/documents/report.pdf", "size": 15728640, "content_hash": "abc..." &#125;

POST /api/v2/files/list_folder/longpoll
  Body: &#123; "cursor": "cursor_abc" &#125;
  Response: &#123; "changes": true &#125;  (long-poll returns when changes are available)

POST /api/v2/files/list_folder/continue
  Body: &#123; "cursor": "cursor_abc" &#125;
  Response: &#123; "entries": [&#123; "tag": "file", "name": "report.pdf", "path": "...", "content_hash": "..." &#125;], "cursor": "new_cursor", "has_more": false &#125;

Specific Data Model

Comparison table for Design Dropbox showing key metrics and tradeoffs
Comparing key aspects of Design Dropbox

File Metadata (PostgreSQL, sharded by user_id)

ColumnTypeNotes
file_idUUIDPrimary key
user_idBIGINTShard key
pathVARCHARFull file path within user's Dropbox
content_hashVARCHAR(64)SHA-256 of file content, used for dedup
size_bytesBIGINT
versionINTIncremented on each edit
is_deletedBOOLEANSoft delete for version history
modified_atTIMESTAMPClient-side modification time
server_modified_atTIMESTAMPWhen the server received the change

Chunk Store (Object Storage — S3/GCS)

  • Key: SHA-256 hash of chunk content (content-addressable)
  • Value: 4 MB chunk data (encrypted at rest)
  • Deduplication is automatic: identical chunks across all users share one physical copy

File-to-Chunks Mapping (PostgreSQL)

ColumnTypeNotes
file_idUUID
versionINT
chunk_indexINTPosition in file
chunk_hashVARCHAR(64)Reference to chunk in object storage

Sync Journal (Cassandra): Ordered log of all changes per namespace (user or shared folder). Clients maintain a cursor and poll for changes since their last sync point.

Key components of Design Dropbox with roles and responsibilities
Key components of Design Dropbox

Specific Back-of-the-Envelope Numbers

Traffic:

  • 700M+ registered users, ~15M paying users
  • Average user syncs ~2 GB/month of changed data
  • Assume 50M active daily users, each syncing ~100 MB/day average
  • File operations: 50M users * 20 file operations/day = 1 billion file ops/day = ~12,000 ops/second

Storage:

  • Total stored data: estimated at 1+ exabyte across all users
  • Deduplication saves ~30-60% of raw storage (many users store common files: OS installers, popular downloads)
  • Chunk store: average 4 MB chunks, 250 billion+ chunks stored

Sync performance:

  • Small file change (under 4 MB): single chunk upload, under 5 seconds sync time
  • Large file change (100 MB file, 1 MB changed): only 1 chunk re-uploaded out of 25, 96% bandwidth saved
  • Delta sync for a 1 GB file with 1% change: upload 10 MB instead of 1 GB

Bandwidth:

  • 50M users * 100 MB/day = 5 PB/day of upload bandwidth
  • Download traffic is typically 2-3x upload (multiple devices syncing) = 10-15 PB/day
  • Peak: 3-5x average during business hours in each timezone

Sources