How to Choose a Database
How to Choose a Database
"Which database should I use?" is one of the most common system design questions, and the answer is always "it depends." This guide provides a decision framework based on your actual requirements rather than hype or familiarity.
Start With Your Data Model
The shape of your data is the strongest signal for database selection.
Structured data with relationships: If your data has well-defined schemas with relationships between entities (users have orders, orders have items, items belong to categories), start with a relational database (PostgreSQL, MySQL). SQL databases enforce schema consistency, support complex joins, and provide ACID transactions. PostgreSQL is the default choice for most applications — it handles JSON, full-text search, and geospatial data well enough that you often do not need a second database.
Document-oriented data: If your data is hierarchical, schema-flexible, or naturally fits into self-contained documents (user profiles with nested preferences, product catalogs with variable attributes), consider MongoDB or DynamoDB. Document databases excel when you read and write entire documents and rarely need cross-document joins.
Key-value lookups: If your access pattern is primarily "get value by key" with no complex queries — session storage, feature flags, user preferences — a key-value store like Redis or DynamoDB is ideal. Redis adds the benefit of sub-millisecond latency since it operates in memory.
Wide-column / time-series data: If you write enormous volumes of data with a time component (metrics, logs, IoT sensor data, event streams), consider Cassandra, ScyllaDB, or a purpose-built time-series database like InfluxDB or TimescaleDB. These databases optimize for high write throughput and time-range queries.
Graph data: If your core queries traverse relationships (social networks, recommendation engines, fraud detection), a graph database like Neo4j provides efficient traversals that would require expensive recursive joins in SQL.
Full-text search: If your primary use case is searching text (product search, log analysis), Elasticsearch or OpenSearch provide inverted indexes optimized for text queries. These are typically used alongside a primary database, not as a replacement.
Decision Tree
- Do you need ACID transactions across multiple entities? If yes, use a relational database (PostgreSQL, MySQL). DynamoDB supports transactions within a single table; Spanner supports them globally.
-
Is your read pattern primarily key-based lookups? If yes and you need speed, use Redis. If you need durability and scale, use DynamoDB.
-
Is your write volume extremely high (>100K writes/sec)? If yes, consider Cassandra, ScyllaDB, or DynamoDB. Relational databases struggle at extreme write volumes on a single node.
-
Do you need flexible schemas? If your schema changes frequently or varies between records, document databases (MongoDB) or wide-column stores (Cassandra) provide more flexibility than relational databases.
-
Is your data time-series? Use TimescaleDB (if you want SQL) or InfluxDB (purpose-built). Cassandra also works well for time-series with proper partition key design.
- Do you need full-text search? Add Elasticsearch alongside your primary database. Do not use it as your source of truth.
Common Mistakes
Choosing based on scale you do not have. PostgreSQL handles millions of rows and thousands of queries per second on a single node. Most startups will never outgrow it. Do not pick Cassandra because you might need it someday — pick it when you actually need horizontal write scaling.
Using one database for everything. A relational database is a great default, but using it for full-text search, caching, and message queuing leads to a slow, overloaded system. Use the right tool for each job.
Ignoring operational complexity. Running a Cassandra cluster requires significant operational expertise. A managed service (DynamoDB, Cloud Spanner, managed PostgreSQL) trades cost for operational simplicity.
Picking a database because of one feature. MongoDB has flexible schemas, but if you also need transactions across collections, that flexibility comes at a cost. Evaluate the full set of requirements, not one standout feature.
The Default Recommendation
If you are unsure, start with PostgreSQL. It is the most versatile database available: relational with ACID, JSON support for semi-structured data, full-text search for basic search needs, LISTEN/NOTIFY for simple pub/sub, and a mature ecosystem of tools and extensions. Add specialized databases only when PostgreSQL demonstrably cannot handle a specific workload.
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
What Most Articles Get Wrong
Many articles about How To Choose A Database present an oversimplified view that misses the operational reality. In production, the theoretical best practices often collide with constraints like legacy systems, team expertise, budget limitations, and compliance requirements. The engineers who successfully implement these patterns at scale are the ones who understand not just the "what" but the "when" and "when not to."
The nuance that matters: context determines everything. A pattern that works at Netflix's scale (200M users, 1000+ engineers) is overkill for a startup with 10,000 users and 3 engineers. Always match the solution complexity to the problem complexity.
The Numbers That Matter
- Latency percentiles matter more than averages: p99 latency often reveals problems that p50 hides
- Error budgets quantify acceptable risk: if your SLA is 99.95%, you have 21.9 minutes of downtime per month to spend on deployments and experiments
- Cost per request at scale determines architecture: a $0.001 cost difference per request becomes $1M per year at 1 billion requests/year
- Team cognitive load is the hidden constraint: a system your team cannot understand is a system your team cannot operate safely