Day 23 · Week 4 — Real-World Systems

Design Twitter / X

Fan-out strategies, tweet storage, timeline generation, search indexing, and handling celebrity accounts with 100M followers at 500M tweets per day.

5 Strategies
~4h Study
2 Simulations
5 Quizzes

The Core Challenge: Fan-Out

📢
Fan-out on Write
Pre-compute timelines at tweet time. Push tweet_id into each follower's Redis list immediately. Fast reads (~1ms), but write amplification: 1 tweet × N followers = N cache writes.
📖
Fan-out on Read
Pull tweets from all followed accounts at read time. Fast writes (just persist the tweet), but slow reads — must query N accounts' timelines and merge. Not scalable for accounts with many followings.
🔀
Hybrid Approach
Fan-out on write for normal users. Fan-out on read for celebrities (>1M followers). Timeline read merges pre-computed cache + live celebrity pulls. Balances write amplification vs read complexity.
🗄️
Storage Strategy
Tweets → Cassandra (write-heavy, time-series, horizontally scalable). Timeline cache → Redis sorted sets. User profiles + follow graph → PostgreSQL. Search index → Elasticsearch.
⚠️
The Celebrity Problem: When @elonmusk tweets to 100M+ followers using fan-out on write: 100M Redis LPUSH operations, consuming ~800MB of Redis memory per tweet ($8/tweet at scale). The fan-out queue would take 5+ minutes to drain. This is why Twitter uses the hybrid approach.

Fan-Out Write Amplification

ℹ️
How to use: Drag the slider to change follower count. Watch how write operations, read latency, and memory usage change. Notice when the strategy switches from push to hybrid.
Tweet Fan-Out Visualization

Each dot represents a follower receiving the tweet. Green = received, gray = waiting.

Tweet Storage & Data Model

ℹ️
Why Cassandra for tweets: Twitter stores ~500M tweets/day. Cassandra is ideal: write-heavy workload, time-series access patterns (newest tweets first), horizontally scalable without resharding, and no single point of failure. MySQL handles user profiles and follow graphs (relational data).

Tweet Schema

Field Type Notes
tweet_id bigint (Snowflake) Time-ordered, globally unique without coordination
user_id bigint Creator ID (partition key in Cassandra)
content text ≤280 characters (Twitter enforced at API layer)
created_at timestamp Redundant — encoded in tweet_id, but explicit for queries
reply_to bigint nullable tweet_id of parent tweet for threading
media_keys text[] S3 references for images/video (stored separately)
like_count counter Cassandra counter column — atomic increments, eventual consistency
PYTHON · snowflake_id.py
# Twitter Snowflake ID generation
import time

EPOCH = 1288834974657  # Twitter epoch (Nov 4, 2010) in ms
MACHINE_ID = 1         # 0-1023 — unique per machine/datacenter
_sequence = 0

def generate_tweet_id() -> int:
    global _sequence
    ms = int(time.time() * 1000) - EPOCH
    _sequence = (_sequence + 1) % 4096  # 12-bit sequence: 0-4095
    return (ms << 22) | (MACHINE_ID << 12) | _sequence
    # 64-bit: [timestamp 41b][machine 10b][sequence 12b]
    # Result: larger ID = later tweet (time-sortable!)

# Why Snowflake beats UUID:
# - Time-ordered: sort(tweet_ids) ≈ sort(by time) — no secondary index
# - Globally unique without coordination (each machine has own MACHINE_ID)
# - 64-bit int fits in 8 bytes vs UUID's 16 bytes — 50% storage savings
# - Monotonically increasing: B-tree index pages fill sequentially (no fragmentation)

The Celebrity Problem & Hybrid Solution

When @elonmusk tweets to 100M+ followers, pure fan-out on write is impossible. The hybrid approach uses a different strategy depending on the tweeter's follower count.

⚠️
Pure fan-out on write cost for celebrities: 100M followers × 8 bytes per tweet_id = 800MB Redis memory per tweet. At $0.01/MB, that's $8 per celebrity tweet in Redis memory alone. Plus 100M Redis LPUSH operations taking 5+ minutes. Non-starter.

Hybrid Architecture Flow

User tweets Check follower count
if < 1M followers: fan-out on write → push to all follower timelines if ≥ 1M followers: skip fan-out → mark tweet as "celebrity_tweet"
User reads timeline Fetch pre-computed cache (normal follows) + Pull last 50 tweets from each followed celebrity Merge by Snowflake ID (timestamp order)
PYTHON · timeline_fanout.py
import redis
from celery import Celery

CELEBRITY_THRESHOLD = 1_000_000  # followers
r = redis.Redis()

def post_tweet(user_id: str, tweet_id: str):
    follower_count = db.get_follower_count(user_id)
    if follower_count < CELEBRITY_THRESHOLD:
        # Fan-out on write: push tweet_id to each follower's cache
        fanout_tweet.apply_async((user_id, tweet_id), queue='fanout')
    else:
        # Celebrity: write to own list only — followers pull on read
        r.lpush(f'timeline:celeb:{user_id}', tweet_id)
        r.ltrim(f'timeline:celeb:{user_id}', 0, 799)

@app.task(rate_limit='10000/s')
def fanout_tweet(user_id: str, tweet_id: str):
    # Paginated: fetch follower_ids in batches of 1000
    for batch in db.get_follower_ids_paginated(user_id, batch_size=1000):
        pipe = r.pipeline()
        for fid in batch:
            pipe.lpush(f'timeline:{fid}', tweet_id)
            pipe.ltrim(f'timeline:{fid}', 0, 799)  # keep last 800
        pipe.execute()

def get_home_timeline(user_id: str, limit=20) -> list:
    # 1. Pre-computed timeline for normal follows
    normal_ids = r.lrange(f'timeline:{user_id}', 0, limit * 2)
    # 2. Live-pull from each followed celebrity
    celeb_following = db.get_celeb_following(user_id)
    celeb_ids = []
    for celeb in celeb_following:
        celeb_ids += r.lrange(f'timeline:celeb:{celeb}', 0, 49)
    # 3. Merge + sort by Snowflake ID (timestamp-ordered)
    return sorted(set(normal_ids + celeb_ids), reverse=True)[:limit]

Technology Decisions

Tweet Storage
Cassandra
Not PostgreSQL. Cassandra is built for write-heavy time-series data with horizontal scaling. Partitioned by user_id, clustered by tweet_id (time-ordered). No resharding needed as data grows.
Timeline Cache
Redis Sorted Sets
Not Memcached. Redis supports ZRANGEBYSCORE for time-ordered retrieval, LPUSH/LTRIM for list capping, and ZADD for merging timelines. Memcached is just key-value — no sorted operations.
Search
Elasticsearch
Not PostgreSQL full-text search. Elasticsearch provides real-time inverted index updates (tweets indexed within seconds), relevance ranking, and faceted filtering by time/user/engagement.
Fan-out Queue
Kafka
Not SQS. Kafka allows multiple consumers (fan-out service, search indexer, notification service, analytics) to all process the same tweet_created event independently via consumer groups.

Quiz

Question 1 of 5
Twitter's timeline generation uses "fan-out on write" for most users. When a user tweets, what happens immediately?
ANothing — the timeline is generated fresh on every read request
BThe tweet_id is pushed to each follower's timeline cache (Redis LPUSH) immediately via a fan-out worker
CThe tweet is broadcast via WebSocket to all online followers simultaneously
DThe database stores the tweet and followers query it on their next hourly refresh
Question 2 of 5
A celebrity with 100M followers tweets. The pure "fan-out on write" approach would:
AWork fine — Redis handles 100M writes per tweet efficiently in under a second
BGenerate 100M Redis writes per tweet, taking 5+ minutes and consuming 800MB+ of Redis memory
CBe blocked automatically by Twitter's internal rate limiter before causing any damage
DRequire 100M separate network packets to be sent to all follower servers simultaneously
Question 3 of 5
Twitter uses Snowflake IDs (64-bit integers). They are time-ordered because:
ATwitter sorts all IDs alphabetically in a background process each minute
BThe 41 high-order bits encode millisecond timestamp — a numerically larger ID always means a later tweet
CSnowflake is a database product that automatically sorts all inserts
DIDs are assigned by a central global sequence generator that issues them in order
Question 4 of 5
Cassandra is chosen for tweet storage over PostgreSQL because:
ACassandra supports SQL JOIN operations needed for threading queries
BCassandra offers stronger consistency guarantees than PostgreSQL
CCassandra is optimized for write-heavy, time-series data with linear horizontal scaling — add nodes for more write throughput without resharding
DCassandra natively stores JSON which makes tweet content easier to handle
Question 5 of 5
Twitter search (finding tweets containing "ChatGPT") is implemented using:
AA full table scan of all 500M daily tweets in Cassandra
BPostgreSQL full-text search indexes on the tweet content column
CAn inverted index in Elasticsearch — maps each word to all tweets containing it, updated in near-real-time as tweets are posted
DPre-computed search result pages that are regenerated every 5 minutes