What are the key takeaways from How to Choose a Database?

How to Choose a Database provides practical insights for building production-grade distributed systems. The main lessons apply to system design interviews and real-world architecture decisions.

How to Choose a Database

2025-01-157 min read

How to Choose a Database

"Which database should I use?" is one of the most common system design questions, and the answer is always "it depends." This guide provides a decision framework based on your actual requirements rather than hype or familiarity.

Start With Your Data Model

The shape of your data is the strongest signal for database selection.

How to Choose a Database system architecture with service components and data flow — System architecture for How to Choose a Database

Structured data with relationships: If your data has well-defined schemas with relationships between entities (users have orders, orders have items, items belong to categories), start with a relational database (PostgreSQL, MySQL). SQL databases enforce schema consistency, support complex joins, and provide ACID transactions. PostgreSQL is the default choice for most applications — it handles JSON, full-text search, and geospatial data well enough that you often do not need a second database.

Document-oriented data: If your data is hierarchical, schema-flexible, or naturally fits into self-contained documents (user profiles with nested preferences, product catalogs with variable attributes), consider MongoDB or DynamoDB. Document databases excel when you read and write entire documents and rarely need cross-document joins.

Key-value lookups: If your access pattern is primarily "get value by key" with no complex queries — session storage, feature flags, user preferences — a key-value store like Redis or DynamoDB is ideal. Redis adds the benefit of sub-millisecond latency since it operates in memory.

Wide-column / time-series data: If you write enormous volumes of data with a time component (metrics, logs, IoT sensor data, event streams), consider Cassandra, ScyllaDB, or a purpose-built time-series database like InfluxDB or TimescaleDB. These databases optimize for high write throughput and time-range queries.

Step-by-step diagram showing how How to Choose a Database works in practice — How How to Choose a Database works step by step

Graph data: If your core queries traverse relationships (social networks, recommendation engines, fraud detection), a graph database like Neo4j provides efficient traversals that would require expensive recursive joins in SQL.

Full-text search: If your primary use case is searching text (product search, log analysis), Elasticsearch or OpenSearch provide inverted indexes optimized for text queries. These are typically used alongside a primary database, not as a replacement.

Decision Tree

Do you need ACID transactions across multiple entities? If yes, use a relational database (PostgreSQL, MySQL). DynamoDB supports transactions within a single table; Spanner supports them globally.

Comparison table for How to Choose a Database showing key metrics and tradeoffs — Comparing key aspects of How to Choose a Database

Is your read pattern primarily key-based lookups? If yes and you need speed, use Redis. If you need durability and scale, use DynamoDB.
Is your write volume extremely high (>100K writes/sec)? If yes, consider Cassandra, ScyllaDB, or DynamoDB. Relational databases struggle at extreme write volumes on a single node.
Do you need flexible schemas? If your schema changes frequently or varies between records, document databases (MongoDB) or wide-column stores (Cassandra) provide more flexibility than relational databases.
Is your data time-series? Use TimescaleDB (if you want SQL) or InfluxDB (purpose-built). Cassandra also works well for time-series with proper partition key design.

Data flow diagram for How to Choose a Database showing request and response paths — Data flow through How to Choose a Database

Do you need full-text search? Add Elasticsearch alongside your primary database. Do not use it as your source of truth.

Common Mistakes

Choosing based on scale you do not have. PostgreSQL handles millions of rows and thousands of queries per second on a single node. Most startups will never outgrow it. Do not pick Cassandra because you might need it someday — pick it when you actually need horizontal write scaling.

Using one database for everything. A relational database is a great default, but using it for full-text search, caching, and message queuing leads to a slow, overloaded system. Use the right tool for each job.

Key components of How to Choose a Database with roles and responsibilities — Key components of How to Choose a Database

Ignoring operational complexity. Running a Cassandra cluster requires significant operational expertise. A managed service (DynamoDB, Cloud Spanner, managed PostgreSQL) trades cost for operational simplicity.

Picking a database because of one feature. MongoDB has flexible schemas, but if you also need transactions across collections, that flexibility comes at a cost. Evaluate the full set of requirements, not one standout feature.

The Default Recommendation

If you are unsure, start with PostgreSQL. It is the most versatile database available: relational with ACID, JSON support for semi-structured data, full-text search for basic search needs, LISTEN/NOTIFY for simple pub/sub, and a mature ecosystem of tools and extensions. Add specialized databases only when PostgreSQL demonstrably cannot handle a specific workload.

Interview tips for How to Choose a Database system design questions — Interview tips for How to Choose a Database

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Decision guide showing when to use How to Choose a Database and when to avoid — When to use How to Choose a Database

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

Pros and cons analysis of How to Choose a Database for system design decisions — Advantages and disadvantages of How to Choose a Database

What Most Articles Get Wrong

Many articles about How To Choose A Database present an oversimplified view that misses the operational reality. In production, the theoretical best practices often collide with constraints like legacy systems, team expertise, budget limitations, and compliance requirements. The engineers who successfully implement these patterns at scale are the ones who understand not just the "what" but the "when" and "when not to."

The nuance that matters: context determines everything. A pattern that works at Netflix's scale (200M users, 1000+ engineers) is overkill for a startup with 10,000 users and 3 engineers. Always match the solution complexity to the problem complexity.

Real-world companies using How to Choose a Database in production systems — Real-world examples of How to Choose a Database

The Numbers That Matter

Latency percentiles matter more than averages: p99 latency often reveals problems that p50 hides
Error budgets quantify acceptable risk: if your SLA is 99.95%, you have 21.9 minutes of downtime per month to spend on deployments and experiments
Cost per request at scale determines architecture: a $0.001 cost difference per request becomes $1M per year at 1 billion requests/year
Team cognitive load is the hidden constraint: a system your team cannot understand is a system your team cannot operate safely

Sources

Designing Data-Intensive Applications — Martin Kleppmann