Day 23 — Design Twitter/X

Section 1

The Core Challenge: Fan-Out

📢

Fan-out on Write

Pre-compute timelines at tweet time. Push tweet_id into each follower's Redis list immediately. Fast reads (~1ms), but write amplification: 1 tweet × N followers = N cache writes.

📖

Fan-out on Read

Pull tweets from all followed accounts at read time. Fast writes (just persist the tweet), but slow reads — must query N accounts' timelines and merge. Not scalable for accounts with many followings.

🔀

Hybrid Approach

Fan-out on write for normal users. Fan-out on read for celebrities (>1M followers). Timeline read merges pre-computed cache + live celebrity pulls. Balances write amplification vs read complexity.

🗄️

Storage Strategy

Tweets → Cassandra (write-heavy, time-series, horizontally scalable). Timeline cache → Redis sorted sets. User profiles + follow graph → PostgreSQL. Search index → Elasticsearch.

⚠️

The Celebrity Problem: When @elonmusk tweets to 100M+ followers using fan-out on write: 100M Redis LPUSH operations, consuming ~800MB of Redis memory per tweet ($8/tweet at scale). The fan-out queue would take 5+ minutes to drain. This is why Twitter uses the hybrid approach.

Section 2 — Interactive Simulation

Fan-Out Write Amplification

ℹ️

How to use: Drag the slider to change follower count. Watch how write operations, read latency, and memory usage change. Notice when the strategy switches from push to hybrid.

Follower Count 10,000

Tweet Fan-Out Visualization

Each dot represents a follower receiving the tweet. Green = received, gray = waiting.

Section 3

Tweet Storage & Data Model

ℹ️

Why Cassandra for tweets: Twitter stores ~500M tweets/day. Cassandra is ideal: write-heavy workload, time-series access patterns (newest tweets first), horizontally scalable without resharding, and no single point of failure. MySQL handles user profiles and follow graphs (relational data).

Tweet Schema

Field	Type	Notes
`tweet_id`	bigint (Snowflake)	Time-ordered, globally unique without coordination
`user_id`	bigint	Creator ID (partition key in Cassandra)
`content`	text	≤280 characters (Twitter enforced at API layer)
`created_at`	timestamp	Redundant — encoded in tweet_id, but explicit for queries
`reply_to`	bigint nullable	tweet_id of parent tweet for threading
`media_keys`	text[]	S3 references for images/video (stored separately)
`like_count`	counter	Cassandra counter column — atomic increments, eventual consistency

PYTHON · snowflake_id.py

# Twitter Snowflake ID generation
import time

EPOCH = 1288834974657  # Twitter epoch (Nov 4, 2010) in ms
MACHINE_ID = 1         # 0-1023 — unique per machine/datacenter
_sequence = 0

def generate_tweet_id() -> int:
    global _sequence
    ms = int(time.time() * 1000) - EPOCH
    _sequence = (_sequence + 1) % 4096  # 12-bit sequence: 0-4095
    return (ms << 22) | (MACHINE_ID << 12) | _sequence
    # 64-bit: [timestamp 41b][machine 10b][sequence 12b]
    # Result: larger ID = later tweet (time-sortable!)

# Why Snowflake beats UUID:
# - Time-ordered: sort(tweet_ids) ≈ sort(by time) — no secondary index
# - Globally unique without coordination (each machine has own MACHINE_ID)
# - 64-bit int fits in 8 bytes vs UUID's 16 bytes — 50% storage savings
# - Monotonically increasing: B-tree index pages fill sequentially (no fragmentation)

Section 4

The Celebrity Problem & Hybrid Solution

When @elonmusk tweets to 100M+ followers, pure fan-out on write is impossible. The hybrid approach uses a different strategy depending on the tweeter's follower count.

⚠️

Pure fan-out on write cost for celebrities: 100M followers × 8 bytes per tweet_id = 800MB Redis memory per tweet. At $0.01/MB, that's $8 per celebrity tweet in Redis memory alone. Plus 100M Redis LPUSH operations taking 5+ minutes. Non-starter.

Hybrid Architecture Flow

User tweets → Check follower count →

if < 1M followers: fan-out on write → push to all follower timelines if ≥ 1M followers: skip fan-out → mark tweet as "celebrity_tweet"

User reads timeline → Fetch pre-computed cache (normal follows) + Pull last 50 tweets from each followed celebrity → Merge by Snowflake ID (timestamp order)

PYTHON · timeline_fanout.py

import redis
from celery import Celery

CELEBRITY_THRESHOLD = 1_000_000  # followers
r = redis.Redis()

def post_tweet(user_id: str, tweet_id: str):
    follower_count = db.get_follower_count(user_id)
    if follower_count < CELEBRITY_THRESHOLD:
        # Fan-out on write: push tweet_id to each follower's cache
        fanout_tweet.apply_async((user_id, tweet_id), queue='fanout')
    else:
        # Celebrity: write to own list only — followers pull on read
        r.lpush(f'timeline:celeb:{user_id}', tweet_id)
        r.ltrim(f'timeline:celeb:{user_id}', 0, 799)

@app.task(rate_limit='10000/s')
def fanout_tweet(user_id: str, tweet_id: str):
    # Paginated: fetch follower_ids in batches of 1000
    for batch in db.get_follower_ids_paginated(user_id, batch_size=1000):
        pipe = r.pipeline()
        for fid in batch:
            pipe.lpush(f'timeline:{fid}', tweet_id)
            pipe.ltrim(f'timeline:{fid}', 0, 799)  # keep last 800
        pipe.execute()

def get_home_timeline(user_id: str, limit=20) -> list:
    # 1. Pre-computed timeline for normal follows
    normal_ids = r.lrange(f'timeline:{user_id}', 0, limit * 2)
    # 2. Live-pull from each followed celebrity
    celeb_following = db.get_celeb_following(user_id)
    celeb_ids = []
    for celeb in celeb_following:
        celeb_ids += r.lrange(f'timeline:celeb:{celeb}', 0, 49)
    # 3. Merge + sort by Snowflake ID (timestamp-ordered)
    return sorted(set(normal_ids + celeb_ids), reverse=True)[:limit]

Section 5

Technology Decisions

Tweet Storage

Cassandra

Not PostgreSQL. Cassandra is built for write-heavy time-series data with horizontal scaling. Partitioned by user_id, clustered by tweet_id (time-ordered). No resharding needed as data grows.

Timeline Cache

Redis Sorted Sets

Not Memcached. Redis supports ZRANGEBYSCORE for time-ordered retrieval, LPUSH/LTRIM for list capping, and ZADD for merging timelines. Memcached is just key-value — no sorted operations.

Search

Elasticsearch

Not PostgreSQL full-text search. Elasticsearch provides real-time inverted index updates (tweets indexed within seconds), relevance ranking, and faceted filtering by time/user/engagement.

Fan-out Queue

Kafka

Not SQS. Kafka allows multiple consumers (fan-out service, search indexer, notification service, analytics) to all process the same tweet_created event independently via consumer groups.

Section 6 — Knowledge Check

Quiz

Question 1 of 5

Twitter's timeline generation uses "fan-out on write" for most users. When a user tweets, what happens immediately?

ANothing — the timeline is generated fresh on every read request

BThe tweet_id is pushed to each follower's timeline cache (Redis LPUSH) immediately via a fan-out worker

CThe tweet is broadcast via WebSocket to all online followers simultaneously

DThe database stores the tweet and followers query it on their next hourly refresh

Question 2 of 5

A celebrity with 100M followers tweets. The pure "fan-out on write" approach would:

AWork fine — Redis handles 100M writes per tweet efficiently in under a second

BGenerate 100M Redis writes per tweet, taking 5+ minutes and consuming 800MB+ of Redis memory

CBe blocked automatically by Twitter's internal rate limiter before causing any damage

DRequire 100M separate network packets to be sent to all follower servers simultaneously

Question 3 of 5

Twitter uses Snowflake IDs (64-bit integers). They are time-ordered because:

ATwitter sorts all IDs alphabetically in a background process each minute

BThe 41 high-order bits encode millisecond timestamp — a numerically larger ID always means a later tweet

CSnowflake is a database product that automatically sorts all inserts

DIDs are assigned by a central global sequence generator that issues them in order

Question 4 of 5

Cassandra is chosen for tweet storage over PostgreSQL because:

ACassandra supports SQL JOIN operations needed for threading queries

BCassandra offers stronger consistency guarantees than PostgreSQL

CCassandra is optimized for write-heavy, time-series data with linear horizontal scaling — add nodes for more write throughput without resharding

DCassandra natively stores JSON which makes tweet content easier to handle

Question 5 of 5

Twitter search (finding tweets containing "ChatGPT") is implemented using:

AA full table scan of all 500M daily tweets in Cassandra

BPostgreSQL full-text search indexes on the tweet content column

CAn inverted index in Elasticsearch — maps each word to all tweets containing it, updated in near-real-time as tweets are posted

DPre-computed search result pages that are regenerated every 5 minutes