Day 28 ยท Week 4

Design Slack

Real-time channel fan-out, persistent WebSocket connections at scale, message threading, workspace-scoped search, and file sharing with S3 pre-signed URLs.

12M
DAU
~4h
Study Time
2
Simulations
5
Quizzes
Channel Fan-out Real-time Messaging Message Threading Search File Sharing

Core Design Principles

Before diving into the simulation, understand the four architectural pillars that make Slack's real-time messaging work at 12M DAU scale.

๐Ÿ“ข

Channel Architecture

Messages stored per channel; fan-out to all online members on send. Channel membership cached in Redis as a set โ€” O(1) lookup for online/offline routing decisions.

๐Ÿ”Œ

WebSocket Connections

Each Slack client holds a persistent WebSocket. ~12M concurrent connections distributed across gateway servers. Each gateway tracks its active connections in local memory + Redis for routing.

๐Ÿงต

Message Threading

Parent message + thread_ts creates a threaded view. Slack stores the full thread tree in PostgreSQL. Replies include thread_ts pointing to the parent โ€” no recursive lookups needed.

๐Ÿ”

Search

Elasticsearch indexes all messages per workspace. Search is scoped to workspace โ€” not global. Each workspace has its own ES index shard, making full-text search tractable at scale.

๐Ÿ’ก

Why WebSocket over HTTP Polling?

HTTP long-polling adds 1-5 second latency and wastes server resources holding open connections. WebSocket lets the server push instantly to clients without any polling overhead. At 12M users, even 1s polling interval = 12M HTTP requests/second โ€” impossible without WebSocket.

Channel Fan-Out Visualizer

Click "Post in #general" to see a message fan out to all online members. Toggle member status to see how offline members are handled via push notification queues.

Slack Workspace โ€” Channel Fan-Out

Animated visualization of how a message propagates from sender to all channel members

350
Online Members
150
Offline / Push Queue
0
WS Pushes
โ€”
Fan-out ms

Message Storage & Delivery Pipeline

1
Sender โ†’ WebSocket โ†’ Message Server
Client sends message over its persistent WebSocket to the assigned gateway server. Optimistic UI shows the message immediately. Gateway forwards to Message Service.
2
Persist to PostgreSQL (source of truth)
Message Service assigns a timestamp-based ID (1630000000.000001 format) and writes to PostgreSQL partitioned by channel_id. This ensures message ordering and durability before delivery.
3
Publish to Kafka topic (channel_id as partition key)
Fan-out service reads from Kafka. Using channel_id as partition key ensures all messages for a channel are processed in order by a single consumer. Kafka provides durability and replay capability.
4
Fan-out: online โ†’ WebSocket push, offline โ†’ push notification
Fan-out service reads channel membership from Redis. For online members, finds their gateway server via Redis and routes the message. For offline members, queues APNs/FCM push notifications.

Message Schema

FieldTypeNotes
idvarchar1630000000.000001 โ€” timestamp.sequence format
channel_idvarcharWhich channel this message belongs to
user_idbigintWho sent the message
texttextMessage content (may be empty if file-only)
thread_tsvarcharPoints to parent message id if this is a reply
filesjsonb[]Array of S3 file references with metadata
reactionsjsonbMap of emoji โ†’ [user_ids] for reaction counts
channel_fanout.py
# Slack-style channel fan-out
async def send_channel_message(channel_id: str, sender_id: str, text: str):
    msg_ts = f"{time.time():.6f}"

    # 1. Persist to database (source of truth)
    await db.execute("""
        INSERT INTO messages (ts, channel_id, user_id, text)
        VALUES ($1, $2, $3, $4)
    """, msg_ts, channel_id, sender_id, text)

    # 2. Get channel membership (cached in Redis as a set)
    member_ids = await redis.smembers(f"channel:{channel_id}:members")

    # 3. Fan-out to online members via WebSocket
    online_members = await redis.smembers(f"channel:{channel_id}:online")

    event = {"type": "message", "channel": channel_id,
             "ts": msg_ts, "text": text}

    for member_id in online_members:
        conn_server = await redis.get(f"user:{member_id}:conn_server")
        if conn_server:
            await publish_to_server(conn_server, member_id, event)

    # 4. Push notifications for offline members
    offline = member_ids - online_members
    for member_id in offline:
        await queue_push_notification(member_id, channel_id, text)

Search Architecture

๐Ÿ”

Slack Search Scope

Slack search is scoped to messages in your accessible channels only โ€” not global across all workspaces. This makes it feasible: each workspace is isolated into its own Elasticsearch index. Cross-workspace search would require indexing all messages globally, multiplying the index size by millions of workspaces.

1
Messages โ†’ Kafka topic
Every new message is published to Kafka alongside the fan-out path. A separate consumer group handles search indexing asynchronously โ€” no impact on message delivery latency.
2
Kafka โ†’ Elasticsearch indexer (per workspace index)
Each workspace has its own ES index. The indexer reads from Kafka and writes to the correct workspace index. Field mappings: text (analyzed), user_id, channel_id, ts (range queries), attachments.
3
Search API โ†’ Elasticsearch โ†’ Hydrate from PostgreSQL
Search query hits Elasticsearch for full-text matching. ES returns message IDs. API hydrates full message objects from PostgreSQL. Results filtered by channel membership (access control).
FeatureTechnologyLatency
Full-text message searchElasticsearch<100ms
File & attachment searchElasticsearch + S3 metadata<200ms
Message history (recent)PostgreSQL scan (indexed)<50ms
Real-time index updatesKafka โ†’ ES consumer<2s lag

Why These Technologies?

ComponentChoiceWhy Not the Alternative?
Message storagePostgreSQLStrong consistency for message ordering; Cassandra would require manual ordering logic
Real-time deliveryWebSocket per gatewaySSE (Server-Sent Events) is unidirectional; WebSocket enables bidirectional protocol messages
Fan-out queueKafka (channel_id partition key)Redis pub/sub loses messages if subscriber is down; Kafka persists and replays
SearchElasticsearch per workspaceSolr: worse scalability and REST API; PostgreSQL FTS: no distributed sharding
File storageS3 + pre-signed URLsStoring files in DB: wastes IOPS and DB storage on binary blobs
PresenceRedis TTL keysDB rows need a cleanup job; Redis TTL auto-expires when heartbeat stops
๐Ÿ’ก

The Presence TTL Pattern

Slack clients send WebSocket heartbeats every 30 seconds. The server calls SETEX presence:{user_id} 65 {gateway_server} on each heartbeat. If a client crashes without sending a disconnect, the key expires in 65 seconds automatically โ€” no background cleanup job required.

Quiz โ€” 5 Questions

1. Slack uses WebSocket for message delivery because:
2. A Slack workspace has a #general channel with 1,000 members. When someone posts, messages must be delivered to:
3. Slack message timestamps look like 1630000000.000001. The decimal part (sequence) ensures:
4. Slack search is scoped to channels you're a member of. This is NOT just a privacy feature โ€” it also:
5. When a Slack user goes offline, the server detects this via: