← Course Index

Database Scaling: Replication & Sharding

~25 min · Foundations · DDIA §5–6 · Alex Xu Vol 1, Ch 1

Ref
Primary Source
DDIA — Chapters 5 (Replication) & 6 (Partitioning)

These two chapters are among the most important in distributed systems. Read them after this lesson. Chapter 5 especially — the discussion of replication lag is critical.

The Database is Almost Always Your Bottleneck

Web servers are easy to scale horizontally — they're stateless. Databases are hard to scale because they hold state. At high scale, the DB is almost always where systems hit their limit. This lesson covers the two main techniques: replication (copies) and sharding (splits).

Replication — Copies for Read Scale & Availability

Replication means keeping the same data on multiple nodes. This buys you two things: read scale (spread reads across replicas) and availability (if the primary dies, a replica takes over).

App Server WRITE READ Primary DB Accepts all writes replicates → Replica 1 Reads only Replica 2 Reads only Replica 3 Standby / DR Replication Lag ⚠
Leader-follower replication — primary handles writes, replicas serve reads
Replication lag

Replicas are not instantly updated. There's always a lag (milliseconds to seconds). A user who just posted may not see their post if their next request hits a lagged replica. This is eventual consistency in practice. Mitigations: read-after-write consistency (route writes and immediate subsequent reads to the primary), or read from primary for time-sensitive operations.

Sharding — Splitting for Write Scale

When a single primary DB can't handle your write throughput (or your data doesn't fit on one machine), you shard: split the data across multiple independent database servers, each owning a subset.

App Server Shard Router hash(user_id) % 3 Shard 0 user_id % 3 == 0 (IDs: 0,3,6,9...) Shard 1 user_id % 3 == 1 (IDs: 1,4,7,10...) Shard 2 user_id % 3 == 2 (IDs: 2,5,8,11...) ⚠ Problems Cross-shard JOINs Celebrity / hot shards Rebalancing is hard No global transactions
Hash-based sharding — the router hashes the key to determine which shard owns the data

Shard Key — The Most Important Decision

The shard key determines how data is distributed. A bad shard key creates hotspots (one shard gets all the traffic) and makes cross-shard queries expensive.

Shard KeyProsCons
user_idEven distribution, user's data co-locatedCan't easily query across users
geographic regionData close to users, complianceUneven load (US shard much larger)
timestampNatural time-series partitioningAll writes go to "current" shard — hot!
hash(user_id)Most even distributionRange queries require hitting all shards

Common Sharding Problems

Check Your Understanding

1. Your write throughput has outgrown your primary database. You've already maxed out vertical scaling. What's the next step?
2. You shard on created_at timestamp for an event logging system. What's the problem?
3. A user posts a message, then immediately refreshes their feed. The refresh goes to a read replica. What problem might they see?

🎓 Sharding is one of the most common interview deep dives. Ask me to walk through how Twitter or Instagram shards their user database, or how to handle the celebrity hotspot problem in practice.