← Course Index

Design a Chat System (WhatsApp / Slack)

~25 min · Case Studies · Alex Xu Vol 1, Ch 12

Ref
Primary Source
Alex Xu Vol 1, Chapter 12 — "Design a Chat System"

Covers building stateful WebSocket architectures, managing user presence, and selecting high-throughput message storage layers.

Core Requirements

Designing a messaging service (WhatsApp, Messenger, or Slack) requires solving both stateful connection routing and high-volume message storage:

Choosing the Right Protocol

In standard web applications, clients send HTTP requests and servers respond. In a chat application, the server must be able to **push** messages to the client at any time. We evaluate three approaches:

HTTP Long Polling
Client requests data and server keeps the 
connection open until a new message arrives 
or a timeout occurs.

✗ Overhead: Client must repeatedly open new 
  TCP connections, sending HTTP headers 
  every time.
✗ Inefficient for active chats.
WebSockets (Preferred)
Connection starts as standard HTTP and 
"upgrades" to a persistent, bidirectional, 
low-overhead TCP socket.

✓ Real-time, bidirectional.
✓ Low overhead once established.
✗ Stateful: requires chat servers to hold 
  open connections in memory.

System Architecture

A key difference between a chat system and a typical web app is that **chat servers are stateful**. They maintain active, long-lived WebSocket connections to clients. We use a **Service Registry** (like ZooKeeper) to keep track of which client is connected to which chat server.

Client A Client B WS Chat Server 1 WS Chat Server 2 Routing / Presence Cassandra DB (Chat History)
Chat System Architecture: Chat servers manage persistent WebSocket connections while a central service handles lookup routing and presence

Message Flow (Client A to Client B):

  1. Client A sends a message to WebSocket Chat Server 1.
  2. Chat Server 1 gets the message, allocates a time-sortable ID, and writes it to the message queue for storage.
  3. Chat Server 1 calls the **Routing/Presence service** to locate Client B.
  4. If Client B is online, the Presence service tells Chat Server 1 that Client B is connected to Chat Server 2.
  5. Chat Server 1 forwards the message to Chat Server 2, which pushes the message to Client B over its active WebSocket.
  6. If Client B is offline, the message is stored in the database, waiting to be pulled when Client B connects.

Presence Status (Heartbeat Check)

How does the server know if a user is online or offline? If a user drops their connection (e.g., enters a tunnel), updating their state instantly on WebSocket disconnect creates false offline indicators.
Instead, we use a **Heartbeat Mechanism**:

Choosing the Message Database

Chat systems have a very specific database access pattern:
• The ratio of reads to writes is close to 1:1.
• Query pattern is: retrieve the N most recent messages for a conversation (chronological sequence range queries).
• Relational databases scale poorly for massive index scans when table sizes hit trillions of messages.
Solution: Wide-column stores like **Cassandra** or **HBase**. They store rows on disk sequentially sorted by partition keys (e.g. channel_id) and clustering keys (e.g. message_id), supporting fast sequential lookups and extremely high write throughput.

Check Your Understanding

1. Why is WebSocket preferred over HTTP Long Polling for messaging applications?
2. How does a presence server prevent marking a user offline every time their mobile connection briefly drops?
3. Why is Cassandra preferred over MySQL for storing chat history?