20 deep Q&A pairs covering the full system design spectrum — from 1M RPS architecture to distributed consensus internals
Start with the bottlenecks, not the components. At 1M RPS you need to reason about each layer:
Choose SQL when:
Choose NoSQL when:
Modulo hashing problem: With key % N, adding one server remaps (N-1)/N ≈ 100% of keys. That means a node addition causes a massive cache miss storm or full data reshuffle.
Consistent hashing solution: Both servers and keys are hashed onto a ring (0 to 2³²-1). A key is served by the first server clockwise from its hash position. Adding a server only remaps keys between it and its predecessor — approximately 1/N of all keys.
Virtual nodes: Each physical server owns multiple positions on the ring (e.g., 150 vnodes). This improves load distribution and makes gradual node addition/removal smoother. Cassandra defaults to 256 vnodes per node.
success rate = good_requests / total_requests, p99 latency, availability percentage.Error budgets: 99.9% SLO = 0.1% allowed failures = 8.77 hours downtime/year. If you've used 80% of your error budget by month 3, you stop feature work and focus on reliability. Google's SRE model: when error budget is exhausted, new deployments freeze.
The solution is idempotency keys. The client generates a unique key (UUID) and sends it with every request. The server stores the key + response in a database before processing. On retry:
Implementation details:
POST /charges with Idempotency-Key: uuid headerCAP says: during a network partition, you must choose between Consistency (all nodes see the same data) or Availability (all requests get a response, possibly stale).
CP Systems (choose consistency over availability):
AP Systems (choose availability over consistency):
Why local rate limiting fails: Each of 50 servers tracks independently. A client can send 50× the rate limit by spreading requests across servers.
Centralized Redis solution: Token bucket per client stored in Redis. Lua script for atomic read-modify-write:
local tokens = redis.call('GET', key) / ... / if tokens > 0 then DECR and allow else block
ZADD timestamp, ZREMRANGEBYSCORE old entries, ZCARD for count)Advanced: Two-tier — local in-memory token bucket for fast path, sync to Redis every 100ms. Accepts ≤10% rate limit overage in exchange for 10× lower Redis load.
Fan-out on write: When Lady Gaga tweets, precompute and write her tweet to all 50M follower timelines. Users get instant timeline loads but writes are expensive (50M Redis operations per tweet).
Fan-out on read: Timeline is computed dynamically by fetching tweets from everyone a user follows. Reads are expensive but writes are cheap. Doesn't scale for 500K followers.
Twitter's hybrid approach:
Kafka guarantees ordering within a partition. Across partitions, there are no ordering guarantees. The design pattern:
Trade-offs:
For global ordering: Use one partition. Throughput = single consumer throughput. Rarely the right choice at scale.
The problem: Fetching 100 orders, then for each order fetching the user separately = 101 queries. The ORM hides this — it looks like 2 lines of code but generates 101 SQL queries.
Detection in production:
pg_stat_statements: query count per endpoint spikes to 100× expectedFix options:
SELECT orders.*, users.* FROM orders JOIN users — one querySELECT * FROM users WHERE id IN (...)Speed of light limits: US ↔ Europe is ~70ms one-way. You cannot achieve <100ms roundtrip transatlantic without edge computing.
Architecture layers:
WAL (Write-Ahead Log): Every database change is written sequentially to the WAL before modifying data files. Sequential writes are 100× faster than random writes. If a crash occurs, PostgreSQL replays the WAL from the last checkpoint to recover.
How replication uses WAL:
CDC via WAL: Debezium reads PostgreSQL WAL via logical decoding and publishes change events to Kafka — enabling event-driven architecture without polling the database.
The expand-migrate-contract (online schema change) pattern:
ALTER TABLE ADD COLUMN new_col TEXT. This is instantaneous in PostgreSQL (doesn't rewrite table).UPDATE orders SET new_col = ... WHERE id BETWEEN X AND Y. Takes hours but doesn't block.SET NOT NULL in PG 12+ is metadata-only if column has no nulls).Backpressure is a flow control mechanism where downstream systems signal upstream to slow down when they can't keep up. Without it, queues grow unboundedly and eventually crash the system.
Implementation strategies:
consumer_group_lag. If lag grows, alert and add consumers or throttle producers at source.Load shedding vs backpressure: Backpressure slows the producer (preserves all messages). Load shedding drops low-priority messages (preserves throughput). Choose based on whether dropping is acceptable.
Step 1 — Diagnose hot partitions: CloudWatch metric ConsumedWriteCapacityUnits per partition. If one partition is maxed out while others are idle = hot partition from bad key design.
Hot partition fixes:
user_id#random(1-100)) then scatter-gather readsStep 2 — Check for large items: Items >400KB incur multiple partition reads. Normalize large attributes to S3 and store S3 key in DynamoDB.
Step 3 — Eventually consistent vs strongly consistent reads: Strongly consistent reads cost 2× capacity and have higher latency. Switch to eventually consistent where possible.
Step 4 — Global Secondary Index: GSI reads from a separate partition — subject to same hot partition issues as base table. Ensure GSI partition key has good cardinality.
Use REST/gRPC (synchronous) when:
Use message queue (async) when:
Data structure: Trie (prefix tree) in memory. Each node = character, path from root = prefix, leaves = suggestions with weights. O(k) lookup where k = prefix length.
At 100K QPS architecture:
Ranking signals: Search frequency, recency, personalization (user's search history), trending boost.
Real-time vs batch: For trending topics (breaking news), stream search queries through Kafka → count prefix frequencies in windowed aggregation → update Redis suggestions every 60 seconds.
Pessimistic locking (SELECT FOR UPDATE):
Optimistic locking (version column):
Decision matrix:
What actually breaks at 4 nines:
Monolith advantages (often underrated):
When microservices make sense:
| Property | B-Tree (PostgreSQL/MySQL) | LSM Tree (RocksDB/Cassandra) |
|---|---|---|
| Write performance | Moderate Sequential WAL + random heap write | Excellent Sequential memtable + SSTable flush |
| Read performance | Excellent O(log n) via index | Good May check multiple SSTables + bloom filter |
| Space amplification | Low Pages updated in place | Medium-High Multiple versions until compaction |
| Write amplification | Medium WAL + page update | High Data written multiple times during compaction |
| Best for | General OLTP, ad-hoc queries | Write-heavy, time-series, event logs |
| Model | Guarantee | Latency | Example |
|---|---|---|---|
| Linearizability | Reads see latest committed write; appears as single copy | High | etcd, Zookeeper |
| Sequential consistency | All ops appear in some sequential order; each client sees own ops in order | Medium | Raft-based systems |
| Causal consistency | Causally related operations seen in order; concurrent ops can diverge | Low | MongoDB sessions, CockroachDB |
| Eventual consistency | Replicas converge given no new updates; reads may be stale | Lowest | DynamoDB default, Cassandra |
| Algorithm | Burst Handling | Memory | Accuracy | Best For |
|---|---|---|---|---|
| Token Bucket | Allows controlled burst | O(1) | High | API rate limiting with burst allowance |
| Leaky Bucket | Smooths bursts to fixed rate | O(1) | High | Traffic shaping, network QoS |
| Fixed Window | 2× spike at window boundary | O(1) | Low (boundary issue) | Simple use cases, not recommended |
| Sliding Window | Smooth, no boundary spike | O(requests) | Highest | Precise per-user limits |