Databases and Storage SystemsLesson 3.3

Database sharding strategies for large-scale systems

horizontal sharding, shard key selection, range vs hash sharding, cross-shard queries, hotspot shards, resharding complexity

What Sharding Solves

When a single database server can't hold all your data or handle all writes, you split data across multiple database nodes - this is sharding.

Shard Key Selection

The shard key determines which shard a record lives on. A bad shard key creates hotspots.

// Bad shard key: created_at date
// All new inserts go to the same shard (today's shard)
shard = hash(created_at) % num_shards // HOTSPOT

// Good shard key: user_id
// Inserts are distributed across all shards
shard = hash(user_id) % num_shards // EVEN

Range vs Hash Sharding

Range Sharding: shard 0 holds user IDs 0–1M, shard 1 holds 1M–2M. Easy range scans but hotspots at high-traffic ID ranges.
Hash Sharding: hash(key) % N distributes evenly but kills range queries.

Cross-Shard Queries

Queries that touch multiple shards are expensive. You must query all shards and merge results in the application layer. This is why denormalization is common in sharded systems - you store data redundantly to avoid cross-shard joins.

Resharding

When you add shards, you must move data between them. This is one of the hardest operational problems in distributed databases. Consistent hashing reduces the movement required.