Interview Prep

20 deep Q&A pairs covering the full system design spectrum — from 1M RPS architecture to distributed consensus internals

Core Design Questions

Design a system that handles 1 million requests per second. Walk me through your architecture decisions.

Hard

▼

Start with the bottlenecks, not the components. At 1M RPS you need to reason about each layer:

DNS + CDN: Anycast routes to nearest PoP. Static assets never hit origin. Dynamic requests get edge-cached with short TTLs.
Load Balancers: L4 (HAProxy/AWS NLB) for raw throughput, L7 (nginx/Envoy) for HTTP routing. Use consistent hashing for sticky session requirements.
App Tier: Horizontally scale stateless services. Each node handles ~5K-10K RPS. At 1M RPS you need ~100-200 instances depending on request complexity.
Cache Layer: Redis cluster with read replicas. Target 90%+ cache hit ratio — that drops DB load to ~100K effective QPS.
Database: Sharded PostgreSQL or Cassandra depending on consistency requirements. Connection pooling (PgBouncer) is non-negotiable.
Async offloading: Write-heavy operations go to Kafka and are processed asynchronously. Never make the user wait for a DB write on the critical path.

Key insight: At this scale, the read/write ratio matters most. If it's 100:1 read-heavy, cache-first solves it. If it's write-heavy, you need sharding + async pipelines.

When would you choose SQL over NoSQL? Give me concrete criteria, not generic trade-offs.

Medium

▼

Choose SQL when:

You need ACID transactions across multiple entities (e.g., payment + inventory decrement atomically)
Your queries are ad-hoc or unknown at design time (SQL is expressive; NoSQL requires knowing queries upfront)
Data is highly relational with complex joins (e.g., reporting, analytics)
You're at early stage and write patterns aren't clear yet — SQL is more forgiving of schema evolution
Regulatory/compliance requirements demand strong consistency and audit trails

Choose NoSQL when:

You have a known, stable access pattern and can denormalize for it (DynamoDB single-table design)
You need horizontal write scaling beyond what Postgres vertical scaling provides (~50K writes/s)
Schema is truly schema-less or evolving rapidly (document stores)
You need sub-millisecond key-value lookups at massive scale (DynamoDB, Redis)
Time-series, wide-column data (Cassandra for IoT, Cassandra for activity feeds)

The real answer: Most systems start SQL. You migrate to NoSQL for specific high-scale components, not the whole system. Uber has both — payments in SQL, driver locations in Cassandra.

Explain consistent hashing and why it's better than modulo hashing for distributed systems.

Medium

▼

Modulo hashing problem: With key % N, adding one server remaps (N-1)/N ≈ 100% of keys. That means a node addition causes a massive cache miss storm or full data reshuffle.

Consistent hashing solution: Both servers and keys are hashed onto a ring (0 to 2³²-1). A key is served by the first server clockwise from its hash position. Adding a server only remaps keys between it and its predecessor — approximately 1/N of all keys.

Virtual nodes: Each physical server owns multiple positions on the ring (e.g., 150 vnodes). This improves load distribution and makes gradual node addition/removal smoother. Cassandra defaults to 256 vnodes per node.

Adding 1 server to a 10-server ring: modulo remaps 90% of keys, consistent hashing remaps 10%
Weighted servers: more powerful servers get more vnodes → proportionally more keys
Used by: Cassandra (token ring), Redis Cluster (hash slots), CDN edge selection

What's the difference between SLO, SLI, and SLA? How do error budgets work?

Medium

▼

SLI (Service Level Indicator): The actual measurement — e.g., success rate = good_requests / total_requests, p99 latency, availability percentage.
SLO (Service Level Objective): The target — e.g., "p99 latency < 200ms", "availability > 99.9%". Internal engineering target.
SLA (Service Level Agreement): Contractual commitment to customers with financial consequences (credits, refunds) if breached. Usually less strict than SLO.

Error budgets: 99.9% SLO = 0.1% allowed failures = 8.77 hours downtime/year. If you've used 80% of your error budget by month 3, you stop feature work and focus on reliability. Google's SRE model: when error budget is exhausted, new deployments freeze.

Key insight: Error budgets align incentives — devs want to ship fast, ops wants stability. The budget gives both teams a shared objective to optimize against, not a blame game.

How do you prevent double-charging a customer when a payment request is retried?

Hard

▼

The solution is idempotency keys. The client generates a unique key (UUID) and sends it with every request. The server stores the key + response in a database before processing. On retry:

If key exists + processing complete → return stored response immediately
If key exists + processing in-flight → wait and return when complete (or 409 Conflict)
If key not found → process normally and store result atomically

Implementation details:

Store idempotency records in the same database transaction as the payment record
Key expiry: 24-48 hours (balance between safety and storage costs)
Never process if key exists, regardless of request body differences
Stripe's implementation: POST /charges with Idempotency-Key: uuid header

Edge case: If the payment service crashed after charging but before storing the idempotency key, you have a double-charge risk. Fix: write idempotency key first in a BEGIN transaction, then charge, then commit — never charge without the key being stored atomically.

Explain CAP theorem with concrete system examples — not the textbook definition.

Medium

▼

CAP says: during a network partition, you must choose between Consistency (all nodes see the same data) or Availability (all requests get a response, possibly stale).

CP Systems (choose consistency over availability):

HBase/Zookeeper: If majority of nodes can't communicate, refuse writes. Bank balances must be consistent.
MongoDB w/ write concern = majority: Write fails if replica set loses quorum — correct behavior for financial data.
etcd/Consul: Leader election requires quorum. Without it, cluster refuses service rather than risk split-brain.

AP Systems (choose availability over consistency):

Cassandra: All nodes accept writes even during partition. Reads may return stale data. Eventually consistent. Right for user profiles, sensor data.
DynamoDB (default): Eventually consistent reads, always available. Right for shopping carts, leaderboards.
DNS: Returns cached (possibly stale) results rather than failing. The web would be unusable if DNS were CP.

The nuance: PACELC extends CAP — even without partitions, there's a latency/consistency trade-off. DynamoDB offers both eventual (low latency) and strongly consistent (higher latency) reads. Design for the actual failure mode, not the theoretical worst case.

Design a distributed rate limiter for an API gateway with 50 servers and 10M req/min.

Hard

▼

Why local rate limiting fails: Each of 50 servers tracks independently. A client can send 50× the rate limit by spreading requests across servers.

Centralized Redis solution: Token bucket per client stored in Redis. Lua script for atomic read-modify-write:

local tokens = redis.call('GET', key) / ... / if tokens > 0 then DECR and allow else block

Algorithm: Sliding window with Redis sorted sets (ZADD timestamp, ZREMRANGEBYSCORE old entries, ZCARD for count)
TTL: Set key expiry = window size to auto-cleanup inactive clients
Performance: Redis handles 100K+ ops/sec; 50 servers × 200K RPS = 10M RPS comfortably
Failure mode: If Redis is down, fail open (allow all) or fail closed (block all) — fail open is usually correct to avoid widespread outage

Advanced: Two-tier — local in-memory token bucket for fast path, sync to Redis every 100ms. Accepts ≤10% rate limit overage in exchange for 10× lower Redis load.

What is the fan-out problem and how does Twitter solve it?

Hard

▼

Fan-out on write: When Lady Gaga tweets, precompute and write her tweet to all 50M follower timelines. Users get instant timeline loads but writes are expensive (50M Redis operations per tweet).

Fan-out on read: Timeline is computed dynamically by fetching tweets from everyone a user follows. Reads are expensive but writes are cheap. Doesn't scale for 500K followers.

Twitter's hybrid approach:

Users with <1M followers: fan-out on write (precomputed timelines)
Celebrities with >1M followers: fan-out on read (injected at query time)
Timeline load = precomputed timeline + real-time merge of celebrity tweets
Result: O(1) timeline reads for regular users, manageable write load

Key data: Twitter precomputes timelines in Redis (800 tweet limit per user). Timeline service fetches from Redis first, falls back to DB. Celebrity tweets are fetched separately and merged at read time using the follower's local timeline as context.

How do you guarantee message ordering in Kafka when you have multiple consumers?

Medium

▼

Kafka guarantees ordering within a partition. Across partitions, there are no ordering guarantees. The design pattern:

Key-based partitioning: Messages with the same key always go to the same partition (hash(key) % numPartitions). All orders for user_id=123 are in the same partition.
One consumer per partition: A consumer group assigns one consumer per partition. No two consumers read the same partition simultaneously.
Result: All messages for a given key are processed in order by one consumer

Trade-offs:

Parallelism is limited to partition count — adding consumers beyond partition count is wasteful
Hot partitions: if one key gets 90% of traffic, one consumer handles 90% of load
Rebalancing disrupts ordering briefly — during rebalance, a partition may be reassigned to a new consumer

For global ordering: Use one partition. Throughput = single consumer throughput. Rarely the right choice at scale.

Q10

What is the N+1 query problem and how do you detect/fix it in production?

Medium

▼

The problem: Fetching 100 orders, then for each order fetching the user separately = 101 queries. The ORM hides this — it looks like 2 lines of code but generates 101 SQL queries.

Detection in production:

pg_stat_statements: query count per endpoint spikes to 100× expected
APM tools (Datadog, New Relic): trace shows repeated identical queries in a span
Slow response times despite fast individual queries (latency adds up)

Fix options:

Eager loading / JOIN: SELECT orders.*, users.* FROM orders JOIN users — one query
DataLoader pattern (GraphQL): batch all user_id lookups, one SELECT * FROM users WHERE id IN (...)
Caching: first user lookup is cached, subsequent lookups are O(1) — N+1 becomes 1+1
Denormalization: store user_name on the orders table — no join needed at all

Q11

Design a system to serve users globally with <100ms latency. What's your architecture?

Hard

▼

Speed of light limits: US ↔ Europe is ~70ms one-way. You cannot achieve <100ms roundtrip transatlantic without edge computing.

Architecture layers:

Static content: CDN (Cloudflare/Fastly) at 300+ PoPs. Assets served from <5ms away.
Read-heavy dynamic content: Regional edge caches (5-10 regions globally). Cache-Control with stale-while-revalidate.
Edge compute: Cloudflare Workers / Lambda@Edge for auth, A/B testing, personalization at the edge without origin round-trip.
Active-active multi-region: Primary regions in US-East, EU-West, AP-Southeast. BGP anycast routes users to closest region.
Data replication: Async replication for reads (tolerate stale). Synchronous for writes if consistency is required (pay latency penalty).
Conflict resolution: For writes in multiple regions, last-write-wins or CRDT-based merge depending on data type.

Reality check: <100ms for reads is achievable with CDN + regional deployment. For write operations (forms, payments), you must route to a home region — accept 100-200ms write latency or use CRDTs.

Q12

What is a WAL and how does PostgreSQL use it for durability and replication?

Medium

▼

WAL (Write-Ahead Log): Every database change is written sequentially to the WAL before modifying data files. Sequential writes are 100× faster than random writes. If a crash occurs, PostgreSQL replays the WAL from the last checkpoint to recover.

How replication uses WAL:

Primary streams WAL records to replicas in real-time (streaming replication)
Replica applies WAL records to its own data files, staying in sync
Lag = how many WAL bytes the replica is behind the primary
Synchronous replication: primary waits for replica to confirm WAL receipt before committing (zero RPO, higher latency)
Asynchronous replication: primary commits immediately, replica catches up (low latency, non-zero RPO)

CDC via WAL: Debezium reads PostgreSQL WAL via logical decoding and publishes change events to Kafka — enabling event-driven architecture without polling the database.

Q13

How do you do a zero-downtime schema migration on a 500 million row table?

Hard

▼

The expand-migrate-contract (online schema change) pattern:

Step 1 — Expand: Add the new column as nullable with no constraint. ALTER TABLE ADD COLUMN new_col TEXT. This is instantaneous in PostgreSQL (doesn't rewrite table).
Step 2 — Backfill: UPDATE rows in batches of 1-10K with rate limiting. UPDATE orders SET new_col = ... WHERE id BETWEEN X AND Y. Takes hours but doesn't block.
Step 3 — Deploy app: Deploy new code that writes to both old and new columns (dual-write). Reads from new column if populated, falls back to old.
Step 4 — Validate: Verify backfill is complete. Add NOT NULL constraint (SET NOT NULL in PG 12+ is metadata-only if column has no nulls).
Step 5 — Contract: Deploy code that only reads/writes new column. Drop old column in a subsequent migration.

Tools: pt-online-schema-change (MySQL), pg_repack (PostgreSQL), gh-ost (GitHub's online schema change tool). Always test on a replica first with production data volume.

Q14

What is backpressure and how do you implement it in a distributed pipeline?

Medium

▼

Backpressure is a flow control mechanism where downstream systems signal upstream to slow down when they can't keep up. Without it, queues grow unboundedly and eventually crash the system.

Implementation strategies:

Kafka consumer lag: Monitor consumer_group_lag. If lag grows, alert and add consumers or throttle producers at source.
HTTP 429 Too Many Requests: Server returns 429 when overloaded. Client backs off with exponential jitter.
Bounded queues: In-process queues have a max size. When full, producer blocks (backpressure) or drops (load shedding).
TCP flow control: Receiver window size limits data in flight — built into TCP.
gRPC flow control: HTTP/2 stream-level flow control prevents one slow consumer from blocking others.

Load shedding vs backpressure: Backpressure slows the producer (preserves all messages). Load shedding drops low-priority messages (preserves throughput). Choose based on whether dropping is acceptable.

Q15

DynamoDB is returning high latency at p99. What do you check and how do you fix it?

Hard

▼

Step 1 — Diagnose hot partitions: CloudWatch metric ConsumedWriteCapacityUnits per partition. If one partition is maxed out while others are idle = hot partition from bad key design.

Hot partition fixes:

Add high-cardinality suffix to partition key (e.g., user_id#random(1-100)) then scatter-gather reads
DAX (DynamoDB Accelerator): in-memory cache, reduces read latency from ms to μs
Shard the hot key: store in N partition keys, read all N and merge

Step 2 — Check for large items: Items >400KB incur multiple partition reads. Normalize large attributes to S3 and store S3 key in DynamoDB.

Step 3 — Eventually consistent vs strongly consistent reads: Strongly consistent reads cost 2× capacity and have higher latency. Switch to eventually consistent where possible.

Step 4 — Global Secondary Index: GSI reads from a separate partition — subject to same hot partition issues as base table. Ensure GSI partition key has good cardinality.

Q16

When should you use a message queue vs a direct REST call between services?

Medium

▼

Use REST/gRPC (synchronous) when:

Caller needs the response to continue (user waiting for API response)
Operation is idempotent and simple retry is acceptable
Low latency is critical (<50ms end-to-end)
Operations are not CPU/time intensive on the receiver

Use message queue (async) when:

Operation can be processed asynchronously (email confirmation, audit log, webhooks)
Receiver is slow or unreliable — decouple so receiver failures don't affect caller
Fan-out: one event needs to trigger multiple downstream actions
Workload spikes: queue absorbs bursts, consumers process at steady rate
You need durable delivery guarantees (message can't be lost)

Rule of thumb: If the user is waiting → synchronous. If not → async. "Fire and forget" + at-least-once delivery → queue. Idempotent receivers are non-negotiable for queues (messages may be delivered multiple times).

Q17

Design an autocomplete system for a search box that handles 100K queries/second.

Hard

▼

Data structure: Trie (prefix tree) in memory. Each node = character, path from root = prefix, leaves = suggestions with weights. O(k) lookup where k = prefix length.

At 100K QPS architecture:

Precompute top-K suggestions per prefix offline (nightly batch from search logs)
Store precomputed results in Redis: key = "prefix:que", value = JSON array of top 10 suggestions
99% of queries hit Redis cache (O(1) lookup); CDN edge caches popular prefixes
Cache hit ratio optimizations: cache all prefixes up to 6 chars (low cardinality, high hit ratio)

Ranking signals: Search frequency, recency, personalization (user's search history), trending boost.

Real-time vs batch: For trending topics (breaking news), stream search queries through Kafka → count prefix frequencies in windowed aggregation → update Redis suggestions every 60 seconds.

Q18

Compare optimistic vs pessimistic locking. When does each strategy break down?

Medium

▼

Pessimistic locking (SELECT FOR UPDATE):

Lock row before reading, hold lock until transaction commits
Safe for high-contention data (inventory counts, bank balances)
Breaks down: locks held too long cause queue buildup; distributed systems can't hold locks across network calls; deadlock risk with multiple locks

Optimistic locking (version column):

Read without lock, include version in UPDATE WHERE clause. If version changed = conflict, retry.
High throughput for low-contention data
Breaks down: high contention = high retry rate (livelock); starvation under heavy write load; not suitable when you can't afford to retry

Decision matrix:

Bank transfer (high contention, can't retry blindly) → pessimistic
Wiki page edit (low contention, retry is fine) → optimistic + show conflict to user
Shopping cart add item (concurrent users, idempotent) → optimistic with CRDT merge

Q19

How do you achieve 99.99% availability? What does each additional 9 actually require?

Hard

▼

99% (2 nines): ~87 hours downtime/year. Single server with monitoring. Junior ops can achieve this.
99.9% (3 nines): ~8.7 hours/year. Active-passive failover, health checks, automated restart. Standard for most production systems.
99.99% (4 nines): ~52 minutes/year. Active-active across 2+ AZs, automated failover <30s, zero-downtime deploys, comprehensive runbooks. Requires SRE discipline.
99.999% (5 nines): ~5 minutes/year. Multi-region active-active, automated traffic shifting, chaos engineering, <second failover. Rare, expensive, requires dedicated reliability engineers.

What actually breaks at 4 nines:

Deployments: rolling deploys with health checks, blue-green or canary
Dependencies: your SLA can't exceed your dependencies' SLA. Three 99.9% dependencies = 99.7% at best.
Database failover: automated promotion with quorum, not manual runbooks
Human error: biggest cause of outages — circuit breakers, feature flags, gradual rollout prevent mistakes from being immediate disasters

Q20

Microservices vs monolith: what's the actual trade-off and when do you break apart a monolith?

Medium

▼

Monolith advantages (often underrated):

Single deploy, no network latency between components, simple debugging (one log stream)
Refactoring is cheap — rename a function, compiler catches all callers
Transactions are free — one DB, one transaction boundary
Right for: small teams, early-stage products, unclear domain boundaries

When microservices make sense:

Independent scaling requirements: payments needs 3 nines, recommendations can tolerate more downtime
Independent deployment cadence: different teams need to deploy without coordinating
Technology isolation: ML inference service needs Python + GPU; API server is Go
Org scale: Conway's Law — 10 teams deploying to one monolith = deployment coordination nightmare

Real advice: Start with a well-structured monolith with clear domain boundaries. Extract services when you hit concrete scaling or team coordination problems — not because microservices are the "right architecture." Premature decomposition is one of the costliest architectural mistakes.

Key Trade-off Tables

Database Storage Engines

Property	B-Tree (PostgreSQL/MySQL)	LSM Tree (RocksDB/Cassandra)
Write performance	Moderate Sequential WAL + random heap write	Excellent Sequential memtable + SSTable flush
Read performance	Excellent O(log n) via index	Good May check multiple SSTables + bloom filter
Space amplification	Low Pages updated in place	Medium-High Multiple versions until compaction
Write amplification	Medium WAL + page update	High Data written multiple times during compaction
Best for	General OLTP, ad-hoc queries	Write-heavy, time-series, event logs

Consistency Models

Model	Guarantee	Latency	Example
Linearizability	Reads see latest committed write; appears as single copy	High	etcd, Zookeeper
Sequential consistency	All ops appear in some sequential order; each client sees own ops in order	Medium	Raft-based systems
Causal consistency	Causally related operations seen in order; concurrent ops can diverge	Low	MongoDB sessions, CockroachDB
Eventual consistency	Replicas converge given no new updates; reads may be stale	Lowest	DynamoDB default, Cassandra

Rate Limiting Algorithms

Algorithm	Burst Handling	Memory	Accuracy	Best For
Token Bucket	Allows controlled burst	O(1)	High	API rate limiting with burst allowance
Leaky Bucket	Smooths bursts to fixed rate	O(1)	High	Traffic shaping, network QoS
Fixed Window	2× spike at window boundary	O(1)	Low (boundary issue)	Simple use cases, not recommended
Sliding Window	Smooth, no boundary spike	O(requests)	Highest	Precise per-user limits