Technology
Decision Guide
The most important skill in system design isn't knowing the technology — it's knowing when to choose it and why. This guide covers every major technology decision point you'll face in interviews and in production.
Database Selection
The database choice is the most consequential architectural decision. Getting it wrong means a painful migration under production load. Use this framework to choose correctly the first time.
| Database | Data Model | Best For | Strengths | Weaknesses | Real Example |
|---|---|---|---|---|---|
| PostgreSQL | Relational (rows + joins) | ACID transactions, complex queries, financial data | ACIDJoinsMature | Vertical scaling limits; sharding complex | Stripe ledger, GitHub issues, Airbnb bookings |
| MySQL | Relational | Web apps, read replicas, semi-structured data | Fast readsWide support | Weaker MVCC than Postgres; JSON support added late | Facebook, Twitter user data (early) |
| Cassandra | Wide-column (partition key) | Time-series, write-heavy, globally distributed | Linear scaleMulti-regionNo SPOF | No joins; eventual consistency; bad for ad-hoc queries | Netflix watch history, Discord messages, Uber location |
| DynamoDB | Key-value + document | Serverless, single-table design, predictable access patterns | ManagedAuto-scaleSingle-ms latency | Expensive at high throughput; query patterns locked in at design time | Amazon cart, Snapchat, ad tech |
| MongoDB | Document (BSON/JSON) | Flexible schema, content management, prototyping | Flexible schemaAggregation pipeline | Joins via $lookup (slow at scale); not ACID by default (pre-4.0) | Craigslist, eBay catalog, content platforms |
| Redis | Key-value + data structures | Cache, sessions, leaderboards, pub/sub, rate limiting | ~100K ops/sIn-memoryAtomic ops | Data must fit in RAM; persistence is secondary | Twitter timeline cache, Slack presence, Stripe idempotency |
| Elasticsearch | Inverted index (document) | Full-text search, log analytics, faceted search | Full-textAggregationsNear real-time | Not ACID; expensive; index updates have lag; over-fetching common | GitHub code search, Slack search, Airbnb listing search |
| ClickHouse | Column-oriented (OLAP) | Analytics, reporting, time-series aggregations | Billion row queriesColumnar compression | Not suitable for OLTP (point reads/updates are slow) | Cloudflare analytics, Uber data platform, Contentsquare |
Caching Strategy Selection
The right cache in the right place can reduce DB load by 99% and cut response latency by 10×. The wrong cache strategy causes stale data, stampedes, and cache-induced bugs.
| Technology | Type | Best For | Eviction | Persistence | Latency |
|---|---|---|---|---|---|
| Redis | In-memory, remote | Shared cache across servers, sessions, pub/sub, leaderboards | LRU, LFU, TTL | RDB + AOF snapshots | <1ms |
| Memcached | In-memory, remote | Simple string/binary cache, horizontal scaling | LRU only | None (ephemeral) | <1ms |
| CDN (Cloudflare, CloudFront) | Edge cache, HTTP | Static assets, images, public API responses | TTL + Cache-Control | Edge PoP distributed | ~5-50ms |
| In-Process (L1) | Local memory (JVM heap, Python dict) | Hottest data, config, feature flags | LRU (Caffeine, Guava) | None (process-local) | <0.1ms |
| Varnish / Nginx proxy cache | HTTP reverse proxy cache | Full-page caching, API response caching | TTL + PURGE | Disk or memory | <5ms |
Cache-Control: public, max-age=3600. Never cache auth tokens or personal data.Message Queue Selection
Every distributed system needs async communication. The choice of message queue determines your throughput ceiling, delivery guarantees, and operational complexity.
| Technology | Model | Throughput | Delivery | Retention | Best For |
|---|---|---|---|---|---|
| Apache Kafka | Distributed log (pub/sub) | 1M+ msg/sec per cluster | At-least-once (exactly-once with transactions) | Days to forever (log compaction) | Event streaming, audit logs, CDC, fan-out at scale |
| RabbitMQ | Message broker (AMQP) | 50-100K msg/sec | At-least-once; exactly-once possible | Until consumed (queue-based) | Task queues, RPC, complex routing (fanout, topic, direct) |
| AWS SQS | Managed queue | Unlimited (standard); ordered (FIFO) | At-least-once (standard); exactly-once (FIFO) | Up to 14 days | Decoupling AWS services, simple task queues, Lambda triggers |
| Redis Pub/Sub | In-memory pub/sub | ~100K msg/sec | At-most-once (fire and forget, no persistence) | None (ephemeral) | Real-time notifications, presence updates, invalidation signals |
| Redis Streams | Append-only log (in-memory) | ~100K msg/sec | At-least-once (consumer groups) | Configurable (MAXLEN) | Small-scale event streaming, activity feeds, audit trail |
| Google Pub/Sub | Managed pub/sub | Millions msg/sec | At-least-once | 7 days default (configurable) | GCP-native event streaming, global fan-out |
API Protocol Selection
The protocol you choose determines latency, streaming capability, client compatibility, and development friction. No single protocol wins in every context.
| Protocol | Transport | Bi-Directional? | Best For | Weaknesses |
|---|---|---|---|---|
| REST (HTTP/1.1) | HTTP, JSON | No (request/response) | Public APIs, browser clients, simple CRUD, ubiquitous compatibility | Over-fetching/under-fetching; multiple round-trips; no streaming |
| GraphQL | HTTP, JSON | Subscriptions via WS | Mobile APIs (bandwidth-sensitive), multi-entity queries, rapidly evolving schema | Complex caching; N+1 query problem; over-engineering for simple APIs |
| gRPC | HTTP/2, Protocol Buffers | Yes (bidirectional streaming) | Internal microservice-to-microservice, high throughput, polyglot services | Browser support limited; binary protocol (harder to debug); requires codegen |
| WebSocket | TCP (upgraded HTTP) | Yes (full duplex) | Real-time messaging (chat, gaming, trading), live feeds, presence | Stateful connections (load balancer routing); hard to scale; no request/response semantics |
| Server-Sent Events (SSE) | HTTP/1.1 | Server → Client only | Live feeds, notifications, progress updates | One-direction only; limited browser connections; no binary support |
| Long Polling | HTTP | Simulated | Fallback when WebSocket unavailable; simple notification systems | High server resource usage; latency up to poll interval; inefficient |
Storage Type Selection
Databases store structured records. But images, videos, logs, and backups need different storage. Choosing the wrong storage type means either wasting money or hitting hard limits.
| Type | Examples | Access Pattern | Cost | Best For |
|---|---|---|---|---|
| Object Storage | S3, GCS, Azure Blob | HTTP GET/PUT by key | $0.02-0.03/GB/month | Images, videos, backups, logs, ML datasets, static assets |
| Block Storage | AWS EBS, GCP PD | Random read/write (disk-like) | $0.10-0.15/GB/month | Database disks, OS volumes, low-latency random I/O |
| File Storage (NFS) | AWS EFS, GCP Filestore | POSIX file system, shared mount | $0.30/GB/month | Shared file system (CMS, shared config), legacy apps needing POSIX |
| CDN Edge Cache | Cloudflare, CloudFront, Fastly | HTTP GET (cached at edge PoP) | $0.005-0.085/GB egress | Static assets, video segments (HLS), public API responses |
| In-Memory Storage | Redis, Memcached | Key-based, <1ms | $0.40-0.80/GB/month | Hot cache, sessions, ephemeral data (not primary storage) |
Architecture Pattern Selection
The most expensive decisions to reverse. Monolith vs. microservices, sync vs. async, stateful vs. stateless — these shape your team structure and deployment model for years.
| Pattern | Complexity | Scale | Team Size | Use When | Avoid When |
|---|---|---|---|---|---|
| Monolith | Low | Up to ~100K DAU | 1-20 engineers | Early stage, small team, rapid iteration, unclear domain | Different services need different scaling, independent deployment |
| Modular Monolith | Medium | Up to ~1M DAU | 20-100 engineers | Clear domain boundaries, want monolith simplicity with structure | Services truly need independent scaling or different tech stacks |
| Microservices | High | Unlimited (per service) | 100+ engineers | Different scaling needs per service, large team, clear domain boundaries | Small team (<20), unclear domain, no DevOps/SRE capacity |
| Event-Driven (CQRS/ES) | Very High | Very high read throughput | 50+ engineers | Read/write ratio >100:1, audit log required, temporal queries | Simple CRUD, teams unfamiliar with eventual consistency |
| Serverless (FaaS) | Medium | Spiky/unpredictable | Any | Bursty workloads, event-triggered tasks, minimal ops overhead desired | Low-latency (cold starts), stateful, sustained high throughput (expensive) |
Numbers Every Engineer Should Know
These numbers are your foundation for back-of-envelope estimation. Memorize them. Interviewers expect you to justify architecture decisions with quantitative reasoning.
The 12 Golden Rules of Technology Selection
These rules encode hard-won industry knowledge. They are not absolute laws — but violating them requires an excellent justification.