Alex Xu Vol 1, Chapter 12 — "Design a Chat System"
Covers building stateful WebSocket architectures, managing user presence, and selecting high-throughput message storage layers.
Core Requirements
Designing a messaging service (WhatsApp, Messenger, or Slack) requires solving both stateful connection routing and high-volume message storage:
Functional: 1-on-1 chat, group chat (up to 100 people), online/offline presence status, read receipts.
Non-Functional: Ultra-low latency message delivery, high reliability (no lost messages), support for offline users.
Choosing the Right Protocol
In standard web applications, clients send HTTP requests and servers respond. In a chat application, the server must be able to **push** messages to the client at any time. We evaluate three approaches:
HTTP Long Polling
Client requests data and server keeps the
connection open until a new message arrives
or a timeout occurs.
✗ Overhead: Client must repeatedly open new
TCP connections, sending HTTP headers
every time.
✗ Inefficient for active chats.
WebSockets (Preferred)
Connection starts as standard HTTP and
"upgrades" to a persistent, bidirectional,
low-overhead TCP socket.
✓ Real-time, bidirectional.
✓ Low overhead once established.
✗ Stateful: requires chat servers to hold
open connections in memory.
System Architecture
A key difference between a chat system and a typical web app is that **chat servers are stateful**. They maintain active, long-lived WebSocket connections to clients. We use a **Service Registry** (like ZooKeeper) to keep track of which client is connected to which chat server.
Chat System Architecture: Chat servers manage persistent WebSocket connections while a central service handles lookup routing and presence
Message Flow (Client A to Client B):
Client A sends a message to WebSocket Chat Server 1.
Chat Server 1 gets the message, allocates a time-sortable ID, and writes it to the message queue for storage.
Chat Server 1 calls the **Routing/Presence service** to locate Client B.
If Client B is online, the Presence service tells Chat Server 1 that Client B is connected to Chat Server 2.
Chat Server 1 forwards the message to Chat Server 2, which pushes the message to Client B over its active WebSocket.
If Client B is offline, the message is stored in the database, waiting to be pulled when Client B connects.
Presence Status (Heartbeat Check)
How does the server know if a user is online or offline? If a user drops their connection (e.g., enters a tunnel), updating their state instantly on WebSocket disconnect creates false offline indicators.
Instead, we use a **Heartbeat Mechanism**:
Every 5 seconds, the client sends a small heartbeat ping to the Presence server.
The Presence server updates the user's TTL (Time-To-Live) cache in Redis.
If the Presence server does not receive a heartbeat within 15 seconds, the user's status is automatically set to Offline.
Choosing the Message Database
Chat systems have a very specific database access pattern:
• The ratio of reads to writes is close to 1:1.
• Query pattern is: retrieve the N most recent messages for a conversation (chronological sequence range queries).
• Relational databases scale poorly for massive index scans when table sizes hit trillions of messages.
Solution: Wide-column stores like **Cassandra** or **HBase**. They store rows on disk sequentially sorted by partition keys (e.g. channel_id) and clustering keys (e.g. message_id), supporting fast sequential lookups and extremely high write throughput.
Check Your Understanding
1. Why is WebSocket preferred over HTTP Long Polling for messaging applications?
2. How does a presence server prevent marking a user offline every time their mobile connection briefly drops?
3. Why is Cassandra preferred over MySQL for storing chat history?