Day 28 — Design Slack | System Design

Key Concepts

Core Design Principles

Before diving into the simulation, understand the four architectural pillars that make Slack's real-time messaging work at 12M DAU scale.

📢

Channel Architecture

Messages stored per channel; fan-out to all online members on send. Channel membership cached in Redis as a set — O(1) lookup for online/offline routing decisions.

🔌

WebSocket Connections

Each Slack client holds a persistent WebSocket. ~12M concurrent connections distributed across gateway servers. Each gateway tracks its active connections in local memory + Redis for routing.

🧵

Message Threading

Parent message + thread_ts creates a threaded view. Slack stores the full thread tree in PostgreSQL. Replies include thread_ts pointing to the parent — no recursive lookups needed.

🔍

Search

Elasticsearch indexes all messages per workspace. Search is scoped to workspace — not global. Each workspace has its own ES index shard, making full-text search tractable at scale.

💡

Why WebSocket over HTTP Polling?

HTTP long-polling adds 1-5 second latency and wastes server resources holding open connections. WebSocket lets the server push instantly to clients without any polling overhead. At 12M users, even 1s polling interval = 12M HTTP requests/second — impossible without WebSocket.

Interactive Simulation

Channel Fan-Out Visualizer

Click "Post in #general" to see a message fan out to all online members. Toggle member status to see how offline members are handled via push notification queues.

Architecture

Message Storage & Delivery Pipeline

1

Sender → WebSocket → Message Server

Client sends message over its persistent WebSocket to the assigned gateway server. Optimistic UI shows the message immediately. Gateway forwards to Message Service.

2

Persist to PostgreSQL (source of truth)

Message Service assigns a timestamp-based ID (1630000000.000001 format) and writes to PostgreSQL partitioned by channel_id. This ensures message ordering and durability before delivery.

3

Publish to Kafka topic (channel_id as partition key)

Fan-out service reads from Kafka. Using channel_id as partition key ensures all messages for a channel are processed in order by a single consumer. Kafka provides durability and replay capability.

4

Fan-out: online → WebSocket push, offline → push notification

Fan-out service reads channel membership from Redis. For online members, finds their gateway server via Redis and routes the message. For offline members, queues APNs/FCM push notifications.

Message Schema

Field	Type	Notes
`id`	varchar	1630000000.000001 — timestamp.sequence format
`channel_id`	varchar	Which channel this message belongs to
`user_id`	bigint	Who sent the message
`text`	text	Message content (may be empty if file-only)
`thread_ts`	varchar	Points to parent message id if this is a reply
`files`	jsonb[]	Array of S3 file references with metadata
`reactions`	jsonb	Map of emoji → [user_ids] for reaction counts

channel_fanout.py

# Slack-style channel fan-out
async def send_channel_message(channel_id: str, sender_id: str, text: str):
    msg_ts = f"{time.time():.6f}"

    # 1. Persist to database (source of truth)
    await db.execute("""
        INSERT INTO messages (ts, channel_id, user_id, text)
        VALUES ($1, $2, $3, $4)
    """, msg_ts, channel_id, sender_id, text)

    # 2. Get channel membership (cached in Redis as a set)
    member_ids = await redis.smembers(f"channel:{channel_id}:members")

    # 3. Fan-out to online members via WebSocket
    online_members = await redis.smembers(f"channel:{channel_id}:online")

    event = {"type": "message", "channel": channel_id,
             "ts": msg_ts, "text": text}

    for member_id in online_members:
        conn_server = await redis.get(f"user:{member_id}:conn_server")
        if conn_server:
            await publish_to_server(conn_server, member_id, event)

    # 4. Push notifications for offline members
    offline = member_ids - online_members
    for member_id in offline:
        await queue_push_notification(member_id, channel_id, text)

Search

Search Architecture

🔍

Slack Search Scope

Slack search is scoped to messages in your accessible channels only — not global across all workspaces. This makes it feasible: each workspace is isolated into its own Elasticsearch index. Cross-workspace search would require indexing all messages globally, multiplying the index size by millions of workspaces.

1

Messages → Kafka topic

Every new message is published to Kafka alongside the fan-out path. A separate consumer group handles search indexing asynchronously — no impact on message delivery latency.

2

Kafka → Elasticsearch indexer (per workspace index)

Each workspace has its own ES index. The indexer reads from Kafka and writes to the correct workspace index. Field mappings: text (analyzed), user_id, channel_id, ts (range queries), attachments.

3

Search API → Elasticsearch → Hydrate from PostgreSQL

Search query hits Elasticsearch for full-text matching. ES returns message IDs. API hydrates full message objects from PostgreSQL. Results filtered by channel membership (access control).

Feature	Technology	Latency
Full-text message search	Elasticsearch	<100ms
File & attachment search	Elasticsearch + S3 metadata	<200ms
Message history (recent)	PostgreSQL scan (indexed)	<50ms
Real-time index updates	Kafka → ES consumer	<2s lag

Technology Decisions

Why These Technologies?

Component	Choice	Why Not the Alternative?
Message storage	PostgreSQL	Strong consistency for message ordering; Cassandra would require manual ordering logic
Real-time delivery	WebSocket per gateway	SSE (Server-Sent Events) is unidirectional; WebSocket enables bidirectional protocol messages
Fan-out queue	Kafka (channel_id partition key)	Redis pub/sub loses messages if subscriber is down; Kafka persists and replays
Search	Elasticsearch per workspace	Solr: worse scalability and REST API; PostgreSQL FTS: no distributed sharding
File storage	S3 + pre-signed URLs	Storing files in DB: wastes IOPS and DB storage on binary blobs
Presence	Redis TTL keys	DB rows need a cleanup job; Redis TTL auto-expires when heartbeat stops

💡

The Presence TTL Pattern

Slack clients send WebSocket heartbeats every 30 seconds. The server calls SETEX presence:{user_id} 65 {gateway_server} on each heartbeat. If a client crashes without sending a disconnect, the key expires in 65 seconds automatically — no background cleanup job required.

Knowledge Check

Quiz — 5 Questions

1. Slack uses WebSocket for message delivery because:

2. A Slack workspace has a #general channel with 1,000 members. When someone posts, messages must be delivered to:

3. Slack message timestamps look like 1630000000.000001. The decimal part (sequence) ensures:

4. Slack search is scoped to channels you're a member of. This is NOT just a privacy feature — it also:

5. When a Slack user goes offline, the server detects this via:

Design Slack

Core Design Principles

Channel Architecture

WebSocket Connections

Message Threading

Search

Why WebSocket over HTTP Polling?

Channel Fan-Out Visualizer

Slack Workspace — Channel Fan-Out

Message Storage & Delivery Pipeline

Message Schema

Search Architecture

Slack Search Scope

Why These Technologies?

The Presence TTL Pattern

Quiz — 5 Questions