← Course Index

Design a Notification System

~20 min · Case Studies · Alex Xu Vol 1, Ch 10

Ref
Primary Source
Alex Xu Vol 1, Chapter 10 — "Design a Notification System"

Covers asynchronous notification pipelines, third-party push integrations (APNs, FCM, Twilio), and achieving reliability guarantees.

Understanding the Scope

A notification system sends critical alerts to users across multiple channels. In interviews, establish support for:

System Architecture

A synchronous design (sending emails/pushes directly from the API thread) fails when third-party provider systems are slow or down. The system must be asynchronous, utilizing **Message Queues** for decoupling and workers for scaling.

API Servers User Settings DB Message Queues • SMS Queue • Push Queue • Email Queue Workers Pool APNs / FCM Twilio (SMS) SendGrid (Email) Verify templates / opt-in
Notification System Architecture: Multiple channel queues decouple request spikes from delivery workers

Three Key Production Challenges

1. Guaranteeing Delivery (At-Least-Once)

Notifications must not be lost. We achieve this by:
• Storing a persistent log of all notification statuses (e.g. Sent, Failed, Retrying) in a database.
• Utilizing message brokers with persistence (like Kafka or RabbitMQ disk storage).
• Implementing **retry queues**. If a SendGrid request fails with a 503, workers put the notification back into a retry queue with exponential backoff.

2. Preventing Duplication (Idempotency)

While we want at-least-once delivery, sending the same payment alert twice is a terrible user experience.
Deduplication Strategy: When an event occurs, generate an idempotency key (e.g., transaction_id + event_type). Before workers send a notification, they check in Redis if the key exists (e.g., set key with EXPIRE time). If the key already exists, discard the duplicate request.

3. User Control & Rate Limiting

Clients shouldn't receive 50 spam notifications an hour.
• **Opt-in/Preference Check:** API servers look up user settings first to verify they haven't disabled the channel (e.g., marketing email = disabled).
• **User-level rate limiter:** Limit the number of marketing pushes to e.g. 3 per day per user, throwing out any excess notifications.

Check Your Understanding

1. In a high-volume notification system, why are channel-specific message queues (SMS queue, push queue, email queue) used?
2. How can you prevent a user from receiving a duplicate transaction alert if the notification service is retried due to network issues?
3. What is the standard SRE approach if a third-party SMS gateway returns a transient HTTP 500 error code?