Day 25 ยท Week 4

Design WhatsApp

WebSocket connection management, exactly-once message delivery, offline queuing, group fan-out, and end-to-end encryption at 2B-user scale.

5B
Messages/Day
3.5h
Study Time
2
Simulations
5
Quizzes
WebSocket End-to-End Encryption Offline Queuing Group Messaging Presence

58,000 Messages Per Second

2B users, 5B messages/day = 58K messages/second. Every message must be delivered exactly once, even when recipient is offline. No duplicates, no drops, no data loss.

๐Ÿ”Œ

Persistent Connections

WebSocket (not HTTP) โ€” server pushes messages without polling. Each server holds ~100K long-lived connections simultaneously.

๐Ÿ”

Exactly-Once Delivery

Message dedup via message_id prevents double delivery. ON CONFLICT DO NOTHING on insert ensures idempotent writes even across retries.

๐Ÿ“ฆ

Offline Queue

Messages queued server-side when recipient is offline, flushed on reconnect in FIFO order. Push notification (APNs/FCM) wakes the app.

๐Ÿ”

E2E Encryption

Signal protocol โ€” messages encrypted on device, server sees only ciphertext. Even WhatsApp cannot read your messages with a court order.

โ„น๏ธ

Why WebSocket, Not HTTP?

HTTP is request-response: client asks, server answers. For messaging you need server-initiated push. WebSocket upgrades an HTTP connection to a full-duplex TCP channel โ€” the server can push a message at any time without the client polling. At 58K msgs/sec, polling would generate massive unnecessary traffic on the 99.9% of requests that return nothing.

WebSocket State Machine

Click the buttons below to walk through real WebSocket connection states. Watch how the system handles network drops and reconnection with exponential backoff.

Connection State Machine

Current state shown with colored border. Use controls to transition between states.

DISCONNECTED
โ†’
CONNECTING
โ†’
CONNECTED
โ‡„
RECONNECTING
State: DISCONNECTED โ€” No active WebSocket. Client will attempt connection on next action.
[system] WebSocket client initialized, state=DISCONNECTED

Offline Message Queue

Messages arriving while you are DISCONNECTED accumulate server-side. On reconnect they flush in order.

Server-side offline queue (0 messages):
Queue is empty โ€” connect and drop, then simulate messages

From Sender to Blue Ticks

Every WhatsApp message travels through a precise pipeline. Each tick (checkmark) represents a discrete application-layer acknowledgment โ€” not a TCP guarantee.

Sender Device
Encrypts msg
โ†’ WebSocket โ†’
Chat Server
Stores ciphertext
โ†’ Redis check โ†’
Dedup Layer
msg_id unique?
โ†’
Recipient Server
Looks up connection
โ†’ WebSocket push โ†’
Recipient Device
Decrypts locally
StateIconWhenServer Action
SENTโœ“ (grey)Sender device โ†’ server acknowledgedStore in DB, forward to recipient server
DELIVEREDโœ“โœ“ (grey)Server โ†’ recipient device ACK receivedUpdate delivered_at timestamp in DB
READโœ“โœ“ (blue)Recipient opens chat, app sends read receiptUpdate read_at, push receipt to sender
FAILEDโš  (red)Delivery timeout after 30 daysMove to dead letter queue, alert sender
message_delivery.py
# WhatsApp-style message delivery
async def send_message(sender_id, recipient_id, content, msg_id):
    # 1. Store in DB (source of truth)
    await db.execute("""
        INSERT INTO messages (id, sender_id, recipient_id, content, status)
        VALUES ($1, $2, $3, $4, 'SENT')
        ON CONFLICT (id) DO NOTHING  -- idempotency
    """, msg_id, sender_id, recipient_id, content)

    # 2. Check if recipient is online
    conn = await redis.get(f"conn:{recipient_id}")

    if conn:
        # Online: push directly via WebSocket
        await push_to_connection(conn, {"type": "new_message", "msg": msg_id})
        await db.execute("UPDATE messages SET status='DELIVERED' WHERE id=$1", msg_id)
    else:
        # Offline: queue for later delivery
        await redis.lpush(f"offline_queue:{recipient_id}", msg_id)
        # FCM/APNs push notification
        await send_push_notification(recipient_id, "New message")
โœ…

Idempotency via ON CONFLICT DO NOTHING

If the sender retries after no acknowledgment, the same message_id will hit the INSERT again. The ON CONFLICT clause silently ignores the duplicate โ€” no double delivery, no error thrown, no additional state needed. The response to the retry is identical to the first request.

The Fan-Out Problem

Groups multiply every message by the number of recipients. At scale this creates an enormous amplification problem that requires deliberate architectural design.

โš ๏ธ

The Fan-Out Math

Group with 1,000 members: each message โ†’ 1,000 delivery operations. At 58K messages/sec base rate ร— 1K recipients = 58 million operations/second just for large-group delivery. A naive broadcast approach would melt any server cluster immediately.

๐Ÿ“‹

Store Once

Message stored exactly once in DB. Delivery events are separate records โ€” one per recipient. The message content is never duplicated.

๐Ÿ“จ

Kafka Fan-Out

Group message โ†’ Kafka topic. Each recipient's delivery server consumes the event and handles delivery for its own connected users.

๐Ÿ—‚๏ธ

Participant List

Group membership stored in a separate table. Fan-out service reads members, generates delivery events. Max 1,024 members enforced at write time.

โšก

Async Delivery

Sender gets SENT acknowledgment immediately. Group fan-out happens asynchronously โ€” individual delivered/read receipts trickle back over time.

PlatformMax Group SizeArchitectureTradeoff
WhatsApp1,024Fan-out per message, Kafka deliveryLimited size, strong delivery guarantees
Telegram200,000Channel model, no E2E encryption in groupsMassive scale, weaker privacy
Signal1,000E2E encrypted groups, sender-side fan-outPrivacy-first, higher sender bandwidth

Why Each Technology

Every technology choice in WhatsApp's stack was driven by specific constraints. Understanding the "why" is what interviewers are actually testing.

Transport
WebSocket
Server can push messages instantly without the client asking. Persistent connection avoids repeated TLS handshake overhead. Binary framing reduces message size vs. HTTP headers.
Not HTTP long-polling โ€” wastes bandwidth with empty responses every N seconds when no messages arrive.
Message Queue
Kafka
Durable, replayable log. If a delivery server crashes, it replays from its last offset. Handles 58K msg/sec with horizontal partition scaling. Consumers pull at their own pace.
Not RabbitMQ โ€” no replay after consumption. Lost message = lost delivery with no recovery path.
Presence
Redis TTL Keys
online:{user_id} key with 30-second TTL. Each heartbeat (ping frame) refreshes the TTL. If key expires, user is marked offline. Reads are O(1) at microsecond speed.
Not PostgreSQL โ€” querying DB for presence status on every message delivery is too slow at 58K msg/sec.
Encryption
Signal Protocol
Double Ratchet provides forward secrecy โ€” compromise of one key doesn't expose past messages. Pre-keys enable asynchronous key exchange for offline recipients. Battle-tested by cryptographers.
Not custom crypto โ€” cryptographic protocols are notoriously hard to implement correctly. Custom = vulnerabilities.

Quiz โ€” 5 Questions

1. WhatsApp uses WebSocket instead of HTTP polling because:
2. A message has message_id=uuid. User resends after no ack. Server receives it twice. The idempotency check:
3. User is offline. 50 messages arrive. When they reconnect, the offline queue ensures:
4. End-to-end encryption means:
5. A group has 1,000 members. One member sends a message. Server operations needed for delivery: