Database Scaling: Replication & Sharding

~25 min · Foundations · DDIA §5–6 · Alex Xu Vol 1, Ch 1

Ref

Primary Source

DDIA — Chapters 5 (Replication) & 6 (Partitioning)

These two chapters are among the most important in distributed systems. Read them after this lesson. Chapter 5 especially — the discussion of replication lag is critical.

The Database is Almost Always Your Bottleneck

Web servers are easy to scale horizontally — they're stateless. Databases are hard to scale because they hold state. At high scale, the DB is almost always where systems hit their limit. This lesson covers the two main techniques: replication (copies) and sharding (splits).

Replication — Copies for Read Scale & Availability

Replication means keeping the same data on multiple nodes. This buys you two things: read scale (spread reads across replicas) and availability (if the primary dies, a replica takes over).

Leader-follower replication — primary handles writes, replicas serve reads

Replication lag

Replicas are not instantly updated. There's always a lag (milliseconds to seconds). A user who just posted may not see their post if their next request hits a lagged replica. This is eventual consistency in practice. Mitigations: read-after-write consistency (route writes and immediate subsequent reads to the primary), or read from primary for time-sensitive operations.

Sharding — Splitting for Write Scale

When a single primary DB can't handle your write throughput (or your data doesn't fit on one machine), you shard: split the data across multiple independent database servers, each owning a subset.

Hash-based sharding — the router hashes the key to determine which shard owns the data

Shard Key — The Most Important Decision

The shard key determines how data is distributed. A bad shard key creates hotspots (one shard gets all the traffic) and makes cross-shard queries expensive.

Shard Key	Pros	Cons
`user_id`	Even distribution, user's data co-located	Can't easily query across users
`geographic region`	Data close to users, compliance	Uneven load (US shard much larger)
`timestamp`	Natural time-series partitioning	All writes go to "current" shard — hot!
`hash(user_id)`	Most even distribution	Range queries require hitting all shards

Common Sharding Problems

Celebrity / hotspot problem — Beyoncé's user_id shard gets 1000× the traffic. Mitigate: add a random suffix to hot keys, cache aggressively.
Cross-shard JOINs — You can't efficiently JOIN data across shards. Solution: denormalize (duplicate data) or use an aggregation service.
Resharding — When you need to add more shards, you have to redistribute data. Consistent hashing minimizes this disruption (Lesson 09).
No distributed transactions — ACID across shards is extremely hard. Design your data model to avoid cross-shard writes.

Check Your Understanding

1. Your write throughput has outgrown your primary database. You've already maxed out vertical scaling. What's the next step?

2. You shard on created_at timestamp for an event logging system. What's the problem?

3. A user posts a message, then immediately refreshes their feed. The refresh goes to a read replica. What problem might they see?

🎓 Sharding is one of the most common interview deep dives. Ask me to walk through how Twitter or Instagram shards their user database, or how to handle the celebrity hotspot problem in practice.