Database sharding strategies for large-scale systems
horizontal sharding, shard key selection, range vs hash sharding, cross-shard queries, hotspot shards, resharding complexity
What Sharding Solves
When a single database server can't hold all your data or handle all writes, you split data across multiple database nodes — this is sharding.
Shard Key Selection
The shard key determines which shard a record lives on. A bad shard key creates hotspots.
// Bad shard key: created_at date
// All new inserts go to the same shard (today's shard)
shard = hash(created_at) % num_shards // HOTSPOT
// Good shard key: user_id
// Inserts are distributed across all shards
shard = hash(user_id) % num_shards // EVENRange vs Hash Sharding
- Range Sharding: shard 0 holds user IDs 0–1M, shard 1 holds 1M–2M. Easy range scans but hotspots at high-traffic ID ranges.
- Hash Sharding: hash(key) % N distributes evenly but kills range queries.
Cross-Shard Queries
Queries that touch multiple shards are expensive. You must query all shards and merge results in the application layer. This is why denormalization is common in sharded systems — you store data redundantly to avoid cross-shard joins.
Resharding
When you add shards, you must move data between them. This is one of the hardest operational problems in distributed databases. Consistent hashing reduces the movement required.
