WebSocket connection management, exactly-once message delivery, offline queuing, group fan-out, and end-to-end encryption at 2B-user scale.
2B users, 5B messages/day = 58K messages/second. Every message must be delivered exactly once, even when recipient is offline. No duplicates, no drops, no data loss.
WebSocket (not HTTP) โ server pushes messages without polling. Each server holds ~100K long-lived connections simultaneously.
Message dedup via message_id prevents double delivery. ON CONFLICT DO NOTHING on insert ensures idempotent writes even across retries.
Messages queued server-side when recipient is offline, flushed on reconnect in FIFO order. Push notification (APNs/FCM) wakes the app.
Signal protocol โ messages encrypted on device, server sees only ciphertext. Even WhatsApp cannot read your messages with a court order.
HTTP is request-response: client asks, server answers. For messaging you need server-initiated push. WebSocket upgrades an HTTP connection to a full-duplex TCP channel โ the server can push a message at any time without the client polling. At 58K msgs/sec, polling would generate massive unnecessary traffic on the 99.9% of requests that return nothing.
Click the buttons below to walk through real WebSocket connection states. Watch how the system handles network drops and reconnection with exponential backoff.
Every WhatsApp message travels through a precise pipeline. Each tick (checkmark) represents a discrete application-layer acknowledgment โ not a TCP guarantee.
| State | Icon | When | Server Action |
|---|---|---|---|
| SENT | โ (grey) | Sender device โ server acknowledged | Store in DB, forward to recipient server |
| DELIVERED | โโ (grey) | Server โ recipient device ACK received | Update delivered_at timestamp in DB |
| READ | โโ (blue) | Recipient opens chat, app sends read receipt | Update read_at, push receipt to sender |
| FAILED | โ (red) | Delivery timeout after 30 days | Move to dead letter queue, alert sender |
# WhatsApp-style message delivery async def send_message(sender_id, recipient_id, content, msg_id): # 1. Store in DB (source of truth) await db.execute(""" INSERT INTO messages (id, sender_id, recipient_id, content, status) VALUES ($1, $2, $3, $4, 'SENT') ON CONFLICT (id) DO NOTHING -- idempotency """, msg_id, sender_id, recipient_id, content) # 2. Check if recipient is online conn = await redis.get(f"conn:{recipient_id}") if conn: # Online: push directly via WebSocket await push_to_connection(conn, {"type": "new_message", "msg": msg_id}) await db.execute("UPDATE messages SET status='DELIVERED' WHERE id=$1", msg_id) else: # Offline: queue for later delivery await redis.lpush(f"offline_queue:{recipient_id}", msg_id) # FCM/APNs push notification await send_push_notification(recipient_id, "New message")
If the sender retries after no acknowledgment, the same message_id will hit the INSERT again. The ON CONFLICT clause silently ignores the duplicate โ no double delivery, no error thrown, no additional state needed. The response to the retry is identical to the first request.
Groups multiply every message by the number of recipients. At scale this creates an enormous amplification problem that requires deliberate architectural design.
Group with 1,000 members: each message โ 1,000 delivery operations. At 58K messages/sec base rate ร 1K recipients = 58 million operations/second just for large-group delivery. A naive broadcast approach would melt any server cluster immediately.
Message stored exactly once in DB. Delivery events are separate records โ one per recipient. The message content is never duplicated.
Group message โ Kafka topic. Each recipient's delivery server consumes the event and handles delivery for its own connected users.
Group membership stored in a separate table. Fan-out service reads members, generates delivery events. Max 1,024 members enforced at write time.
Sender gets SENT acknowledgment immediately. Group fan-out happens asynchronously โ individual delivered/read receipts trickle back over time.
| Platform | Max Group Size | Architecture | Tradeoff |
|---|---|---|---|
| 1,024 | Fan-out per message, Kafka delivery | Limited size, strong delivery guarantees | |
| Telegram | 200,000 | Channel model, no E2E encryption in groups | Massive scale, weaker privacy |
| Signal | 1,000 | E2E encrypted groups, sender-side fan-out | Privacy-first, higher sender bandwidth |
Every technology choice in WhatsApp's stack was driven by specific constraints. Understanding the "why" is what interviewers are actually testing.