These two chapters are among the most important in distributed systems. Read them after this lesson. Chapter 5 especially — the discussion of replication lag is critical.
The Database is Almost Always Your Bottleneck
Web servers are easy to scale horizontally — they're stateless. Databases are hard to scale because they hold state. At high scale, the DB is almost always where systems hit their limit. This lesson covers the two main techniques: replication (copies) and sharding (splits).
Replication — Copies for Read Scale & Availability
Replication means keeping the same data on multiple nodes. This buys you two things: read scale (spread reads across replicas) and availability (if the primary dies, a replica takes over).
Replicas are not instantly updated. There's always a lag (milliseconds to seconds). A user who just posted may not see their post if their next request hits a lagged replica. This is eventual consistency in practice. Mitigations: read-after-write consistency (route writes and immediate subsequent reads to the primary), or read from primary for time-sensitive operations.
Sharding — Splitting for Write Scale
When a single primary DB can't handle your write throughput (or your data doesn't fit on one machine), you shard: split the data across multiple independent database servers, each owning a subset.
Hash-based sharding — the router hashes the key to determine which shard owns the data
Shard Key — The Most Important Decision
The shard key determines how data is distributed. A bad shard key creates hotspots (one shard gets all the traffic) and makes cross-shard queries expensive.
Shard Key
Pros
Cons
user_id
Even distribution, user's data co-located
Can't easily query across users
geographic region
Data close to users, compliance
Uneven load (US shard much larger)
timestamp
Natural time-series partitioning
All writes go to "current" shard — hot!
hash(user_id)
Most even distribution
Range queries require hitting all shards
Common Sharding Problems
Celebrity / hotspot problem — Beyoncé's user_id shard gets 1000× the traffic. Mitigate: add a random suffix to hot keys, cache aggressively.
Cross-shard JOINs — You can't efficiently JOIN data across shards. Solution: denormalize (duplicate data) or use an aggregation service.
Resharding — When you need to add more shards, you have to redistribute data. Consistent hashing minimizes this disruption (Lesson 09).
No distributed transactions — ACID across shards is extremely hard. Design your data model to avoid cross-shard writes.
Check Your Understanding
1. Your write throughput has outgrown your primary database. You've already maxed out vertical scaling. What's the next step?
2. You shard on created_at timestamp for an event logging system. What's the problem?
3. A user posts a message, then immediately refreshes their feed. The refresh goes to a read replica. What problem might they see?
🎓 Sharding is one of the most common interview deep dives. Ask me to walk through how Twitter or Instagram shards their user database, or how to handle the celebrity hotspot problem in practice.