Script Valley
Interview Prep: System Design Rounds
Databases and Storage SystemsLesson 3.3

Database sharding strategies for large-scale systems

horizontal sharding, shard key selection, range vs hash sharding, cross-shard queries, hotspot shards, resharding complexity

What Sharding Solves

When a single database server can't hold all your data or handle all writes, you split data across multiple database nodes — this is sharding.

Shard Key Selection

The shard key determines which shard a record lives on. A bad shard key creates hotspots.

// Bad shard key: created_at date
// All new inserts go to the same shard (today's shard)
shard = hash(created_at) % num_shards // HOTSPOT

// Good shard key: user_id
// Inserts are distributed across all shards
shard = hash(user_id) % num_shards // EVEN

Range vs Hash Sharding

  • Range Sharding: shard 0 holds user IDs 0–1M, shard 1 holds 1M–2M. Easy range scans but hotspots at high-traffic ID ranges.
  • Hash Sharding: hash(key) % N distributes evenly but kills range queries.

Cross-Shard Queries

Queries that touch multiple shards are expensive. You must query all shards and merge results in the application layer. This is why denormalization is common in sharded systems — you store data redundantly to avoid cross-shard joins.

Resharding

When you add shards, you must move data between them. This is one of the hardest operational problems in distributed databases. Consistent hashing reduces the movement required.

Up next

How database indexes work and when to use them

Sign in to track progress