Day 25 — Design WhatsApp

The Scale Problem

58,000 Messages Per Second

2B users, 5B messages/day = 58K messages/second. Every message must be delivered exactly once, even when recipient is offline. No duplicates, no drops, no data loss.

🔌

Persistent Connections

WebSocket (not HTTP) — server pushes messages without polling. Each server holds ~100K long-lived connections simultaneously.

🔁

Exactly-Once Delivery

Message dedup via message_id prevents double delivery. ON CONFLICT DO NOTHING on insert ensures idempotent writes even across retries.

📦

Offline Queue

Messages queued server-side when recipient is offline, flushed on reconnect in FIFO order. Push notification (APNs/FCM) wakes the app.

🔐

E2E Encryption

Signal protocol — messages encrypted on device, server sees only ciphertext. Even WhatsApp cannot read your messages with a court order.

ℹ️

Why WebSocket, Not HTTP?

HTTP is request-response: client asks, server answers. For messaging you need server-initiated push. WebSocket upgrades an HTTP connection to a full-duplex TCP channel — the server can push a message at any time without the client polling. At 58K msgs/sec, polling would generate massive unnecessary traffic on the 99.9% of requests that return nothing.

Interactive Simulation

WebSocket State Machine

Click the buttons below to walk through real WebSocket connection states. Watch how the system handles network drops and reconnection with exponential backoff.

Connection State Machine

DISCONNECTED

→

CONNECTING

→

CONNECTED

⇄

RECONNECTING

State: DISCONNECTED — No active WebSocket. Client will attempt connection on next action.

[system] WebSocket client initialized, state=DISCONNECTED

Message Delivery

From Sender to Blue Ticks

Every WhatsApp message travels through a precise pipeline. Each tick (checkmark) represents a discrete application-layer acknowledgment — not a TCP guarantee.

Sender Device

Encrypts msg

→ WebSocket →

Chat Server

Stores ciphertext

→ Redis check →

Dedup Layer

msg_id unique?

→

Recipient Server

Looks up connection

→ WebSocket push →

Recipient Device

Decrypts locally

State	Icon	When	Server Action
SENT	✓ (grey)	Sender device → server acknowledged	Store in DB, forward to recipient server
DELIVERED	✓✓ (grey)	Server → recipient device ACK received	Update delivered_at timestamp in DB
READ	✓✓ (blue)	Recipient opens chat, app sends read receipt	Update read_at, push receipt to sender
FAILED	⚠ (red)	Delivery timeout after 30 days	Move to dead letter queue, alert sender

message_delivery.py

# WhatsApp-style message delivery
async def send_message(sender_id, recipient_id, content, msg_id):
    # 1. Store in DB (source of truth)
    await db.execute("""
        INSERT INTO messages (id, sender_id, recipient_id, content, status)
        VALUES ($1, $2, $3, $4, 'SENT')
        ON CONFLICT (id) DO NOTHING  -- idempotency
    """, msg_id, sender_id, recipient_id, content)

    # 2. Check if recipient is online
    conn = await redis.get(f"conn:{recipient_id}")

    if conn:
        # Online: push directly via WebSocket
        await push_to_connection(conn, {"type": "new_message", "msg": msg_id})
        await db.execute("UPDATE messages SET status='DELIVERED' WHERE id=$1", msg_id)
    else:
        # Offline: queue for later delivery
        await redis.lpush(f"offline_queue:{recipient_id}", msg_id)
        # FCM/APNs push notification
        await send_push_notification(recipient_id, "New message")

✅

Idempotency via ON CONFLICT DO NOTHING

If the sender retries after no acknowledgment, the same message_id will hit the INSERT again. The ON CONFLICT clause silently ignores the duplicate — no double delivery, no error thrown, no additional state needed. The response to the retry is identical to the first request.

Group Messaging

The Fan-Out Problem

Groups multiply every message by the number of recipients. At scale this creates an enormous amplification problem that requires deliberate architectural design.

⚠️

The Fan-Out Math

Group with 1,000 members: each message → 1,000 delivery operations. At 58K messages/sec base rate × 1K recipients = 58 million operations/second just for large-group delivery. A naive broadcast approach would melt any server cluster immediately.

📋

Store Once

Message stored exactly once in DB. Delivery events are separate records — one per recipient. The message content is never duplicated.

📨

Kafka Fan-Out

Group message → Kafka topic. Each recipient's delivery server consumes the event and handles delivery for its own connected users.

🗂️

Participant List

Group membership stored in a separate table. Fan-out service reads members, generates delivery events. Max 1,024 members enforced at write time.

⚡

Async Delivery

Sender gets SENT acknowledgment immediately. Group fan-out happens asynchronously — individual delivered/read receipts trickle back over time.

Platform	Max Group Size	Architecture	Tradeoff
WhatsApp	1,024	Fan-out per message, Kafka delivery	Limited size, strong delivery guarantees
Telegram	200,000	Channel model, no E2E encryption in groups	Massive scale, weaker privacy
Signal	1,000	E2E encrypted groups, sender-side fan-out	Privacy-first, higher sender bandwidth

Technology Decisions

Why Each Technology

Every technology choice in WhatsApp's stack was driven by specific constraints. Understanding the "why" is what interviewers are actually testing.

Transport

WebSocket

Server can push messages instantly without the client asking. Persistent connection avoids repeated TLS handshake overhead. Binary framing reduces message size vs. HTTP headers.

Not HTTP long-polling — wastes bandwidth with empty responses every N seconds when no messages arrive.

Message Queue

Kafka

Durable, replayable log. If a delivery server crashes, it replays from its last offset. Handles 58K msg/sec with horizontal partition scaling. Consumers pull at their own pace.

Not RabbitMQ — no replay after consumption. Lost message = lost delivery with no recovery path.

Presence

Redis TTL Keys

online:{user_id} key with 30-second TTL. Each heartbeat (ping frame) refreshes the TTL. If key expires, user is marked offline. Reads are O(1) at microsecond speed.

Not PostgreSQL — querying DB for presence status on every message delivery is too slow at 58K msg/sec.

Encryption

Signal Protocol

Double Ratchet provides forward secrecy — compromise of one key doesn't expose past messages. Pre-keys enable asynchronous key exchange for offline recipients. Battle-tested by cryptographers.

Not custom crypto — cryptographic protocols are notoriously hard to implement correctly. Custom = vulnerabilities.

Knowledge Check

Quiz — 5 Questions

1. WhatsApp uses WebSocket instead of HTTP polling because:

2. A message has message_id=uuid. User resends after no ack. Server receives it twice. The idempotency check:

3. User is offline. 50 messages arrive. When they reconnect, the offline queue ensures:

4. End-to-end encryption means:

5. A group has 1,000 members. One member sends a message. Server operations needed for delivery: