Uber's Schemaless: A Trip-Optimized Datastore

How Uber built Schemaless, a fault-tolerant, append-only datastore on top of MySQL that powers trip storage, handling millions of writes per second with.

Company Context

System architecture diagram for Uber's Schemaless: A Trip-Optimized Datastore showing how services, databases, and caches connect — System architecture for Uber's Schemaless: A Trip-Optimized Datastore

Uber processes millions of rides per day across hundreds of cities. Each trip generates a cascade of data: ride requests, driver matches, GPS traces, fare calculations, payment transactions, and receipts. This data must be written with high throughput, read with low latency, and retained indefinitely for financial reconciliation and regulatory compliance.

In 2014, Uber's primary datastore was PostgreSQL. As trip volume grew exponentially, they hit the ceiling: write throughput saturated, replication lag increased, and schema migrations on tables with billions of rows became multi-day operations that required maintenance windows.

The Problem at Scale

Step-by-step diagram showing how Uber's Schemaless: A Trip-Optimized Datastore processes a request from start to finish — How Uber's Schemaless: A Trip-Optimized Datastore works step by step

Uber needed a datastore that could:

Handle millions of writes per second across all cities
Support schema changes without downtime or table locking
Provide fault tolerance with no single point of failure
Scale linearly by adding machines
Serve as the source of truth for trip data used by dozens of downstream services

Traditional relational databases could not satisfy all these requirements simultaneously. NoSQL databases like Cassandra were considered but rejected because Uber wanted to retain MySQL's operational familiarity and the ability to run SQL queries for analytics.

Architecture Solution

Comparison table for Uber's Schemaless: A Trip-Optimized Datastore contrasting approaches, tradeoffs, and when to use each — Comparing key metrics for Uber's Schemaless: A Trip-Optimized Datastore

Schemaless is an append-only, sharded datastore built on top of MySQL. Rather than storing data in typed columns, each row contains a row key, a column name, a reference key, and a blob (JSON or MessagePack serialized). The blob holds the actual data, and Schemaless imposes no schema on it — hence the name.

The data model has three levels: a cell is the smallest unit, identified by (row_key, column_name, ref_key). The ref_key is a version identifier, making every cell immutable — updates append a new cell with a new ref_key rather than overwriting the existing one. This append-only design simplifies replication (no conflict resolution needed for concurrent writes) and provides a built-in audit trail.

Data is sharded across MySQL instances using consistent hashing on the row key. Each shard is a standard MySQL database with InnoDB tables. The MySQL layer handles storage, indexing, and local queries. Schemaless handles routing, replication, and cross-shard operations.

Diagram showing the key components and data flow in a Uber's Schemaless: A Trip-Optimized Datastore system design — Schemaless layers routing and append-only semantics on top of sharded MySQL

Key Technical Decisions

Data flow diagram for Uber's Schemaless: A Trip-Optimized Datastore showing how requests and responses move through the system — Data flow through Uber's Schemaless: A Trip-Optimized Datastore

Append-only writes eliminate the need for distributed locking. Two services writing to the same row key at the same time simply create two cells with different ref_keys. The application reads the latest ref_key to get the current state. This design enabled Uber to scale writes horizontally without coordination between shards.

Change Data Capture (CDC) is built into Schemaless. Every cell write generates a changelog event that downstream services can consume. This replaces the need for polling or explicit event publishing. Uber's pricing service, receipt generator, and analytics pipeline all consume the Schemaless changelog to react to trip state changes.

Buffered writes improve throughput. Instead of writing each cell individually to MySQL, Schemaless batches writes and flushes them periodically. This reduces the number of InnoDB transactions and dramatically improves write throughput.

Schema evolution happens at the application level. Since the datastore stores opaque blobs, the application serializes and deserializes the data. Changing the schema means updating the application code — no ALTER TABLE, no table locks, no migration scripts. Backward compatibility is handled through standard serialization practices (optional fields, default values).

Strengths

Component diagram for Uber's Schemaless: A Trip-Optimized Datastore showing each building block and its responsibility — Key components of Uber's Schemaless: A Trip-Optimized Datastore

Horizontal write scaling by adding MySQL shards
Zero-downtime schema evolution
Built-in audit trail through append-only cells
MySQL operational familiarity (backups, monitoring, tooling)
Integrated CDC for real-time event propagation

Weaknesses

Interview preparation checklist for Uber's Schemaless: A Trip-Optimized Datastore with key points to mention and mistakes to avoid — Interview tips for Uber's Schemaless: A Trip-Optimized Datastore

No native support for complex queries (joins, aggregations) — analytics requires a separate system
Application must handle schema versioning and backward compatibility
Append-only storage grows faster than mutable storage, requiring compaction or archival
Each read requires deserializing a blob, which is slower than reading typed columns

Modern Evolution

Decision guide for when to choose Uber's Schemaless: A Trip-Optimized Datastore and when alternative approaches are better — When to use Uber's Schemaless: A Trip-Optimized Datastore

Schemaless evolved into Uber's Docstore, which added indexing, query capabilities, and a more structured API while retaining the sharded MySQL foundation. Docstore now serves as the primary datastore for dozens of Uber services beyond trip storage, including Uber Eats orders and Uber Freight shipments.

Interview Relevance

Schemaless illustrates several important system design patterns: using a proven technology (MySQL) as a building block for a custom distributed system, choosing append-only semantics to simplify distributed writes, and building CDC into the storage layer rather than bolting it on afterward. When discussing database design in interviews, Schemaless is a strong example of pragmatic engineering — using boring technology in a clever way rather than adopting the latest distributed database.

Tradeoff analysis for Uber's Schemaless: A Trip-Optimized Datastore listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of Uber's Schemaless: A Trip-Optimized Datastore

Uber's Schemaless: A Trip-Optimized Datastore study guide and learning recommendations — Uber's Schemaless: A Trip-Optimized Datastore — Study Guide

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

Key Takeaways for Interviews

Understand the core problem this resource addresses and be able to explain it in 2-3 sentences without jargon
Know the key trade-offs: what does this approach optimize for, and what does it sacrifice?
Be ready to compare this with alternative approaches and explain when each is appropriate
Connect the concepts to real-world systems you have worked with or studied
Demonstrate depth by discussing failure modes and how they are handled

How This Applies to Modern .NET Systems

The concepts from this resource translate to .NET through several established libraries and patterns:

Azure managed services often abstract away the underlying distributed systems complexity, but understanding the fundamentals helps you configure them correctly, debug issues, and make informed architectural decisions.

NuGet packages in the .NET ecosystem provide production-ready implementations of many patterns described in this resource. Before building custom solutions, check if a well-maintained package already exists.

ASP.NET Core middleware pipeline is where many of these patterns are implemented in practice: caching, rate limiting, health checks, and circuit breaking all fit naturally into the middleware model.

Sources

Uber Engineering Blog - Schemalessarticle