Technology Decision Guide

Section 1

Database Selection

The database choice is the most consequential architectural decision. Getting it wrong means a painful migration under production load. Use this framework to choose correctly the first time.

Database	Data Model	Best For	Strengths	Weaknesses	Real Example
PostgreSQL	Relational (rows + joins)	ACID transactions, complex queries, financial data	ACIDJoinsMature	Vertical scaling limits; sharding complex	Stripe ledger, GitHub issues, Airbnb bookings
MySQL	Relational	Web apps, read replicas, semi-structured data	Fast readsWide support	Weaker MVCC than Postgres; JSON support added late	Facebook, Twitter user data (early)
Cassandra	Wide-column (partition key)	Time-series, write-heavy, globally distributed	Linear scaleMulti-regionNo SPOF	No joins; eventual consistency; bad for ad-hoc queries	Netflix watch history, Discord messages, Uber location
DynamoDB	Key-value + document	Serverless, single-table design, predictable access patterns	ManagedAuto-scaleSingle-ms latency	Expensive at high throughput; query patterns locked in at design time	Amazon cart, Snapchat, ad tech
MongoDB	Document (BSON/JSON)	Flexible schema, content management, prototyping	Flexible schemaAggregation pipeline	Joins via $lookup (slow at scale); not ACID by default (pre-4.0)	Craigslist, eBay catalog, content platforms
Redis	Key-value + data structures	Cache, sessions, leaderboards, pub/sub, rate limiting	~100K ops/sIn-memoryAtomic ops	Data must fit in RAM; persistence is secondary	Twitter timeline cache, Slack presence, Stripe idempotency
Elasticsearch	Inverted index (document)	Full-text search, log analytics, faceted search	Full-textAggregationsNear real-time	Not ACID; expensive; index updates have lag; over-fetching common	GitHub code search, Slack search, Airbnb listing search
ClickHouse	Column-oriented (OLAP)	Analytics, reporting, time-series aggregations	Billion row queriesColumnar compression	Not suitable for OLTP (point reads/updates are slow)	Cloudflare analytics, Uber data platform, Contentsquare

❓

Do you need joins between entities?

→ SQL (PostgreSQL/MySQL). Joins at scale require careful schema design (denormalization, read replicas). NoSQL systems handle this with application-level joins or denormalized data.

❓

Is write throughput > 10K/s or globally distributed?

→ Cassandra or DynamoDB. Designed for linear write scaling. Cassandra: multi-region active-active. DynamoDB: fully managed, AWS-native. Both sacrifice strong consistency by default.

❓

Is data access by a single key (user_id, session_id)?

→ Redis or DynamoDB. Key-value access at sub-millisecond latency. Redis for in-memory + data structures (sorted sets, lists). DynamoDB for durable, auto-scaling key-value.

❓

Do you need full-text search or complex filters?

→ Elasticsearch. Don't use LIKE queries at scale. ES handles tokenization, relevance scoring, facets, geo queries. Sync from primary DB via Kafka → ES worker.

❓

Is this analytical/reporting (read-heavy, aggregate)?

→ ClickHouse or Redshift. Column-oriented storage is 10-100× faster for aggregations than row-oriented. Separate OLTP (write path) from OLAP (read/analytics path).

❓

Is schema completely unknown or highly variable?

→ MongoDB or DynamoDB. Document stores allow schema evolution without migrations. But: design your access patterns first — "flexible schema" often becomes "schema spaghetti" at scale.

The Golden Rule

"Start with PostgreSQL. Only move to NoSQL when you have a proven, specific bottleneck that PostgreSQL cannot solve." — Most startups that switched to NoSQL too early regret it.

The Polyglot Rule

Real systems use multiple databases. Stripe uses PostgreSQL (ledger) + Redis (idempotency cache) + Elasticsearch (search) + Kafka (events). Use the right tool per access pattern, not one DB for everything.

Section 2

Caching Strategy Selection

The right cache in the right place can reduce DB load by 99% and cut response latency by 10×. The wrong cache strategy causes stale data, stampedes, and cache-induced bugs.

Technology	Type	Best For	Eviction	Persistence	Latency
Redis	In-memory, remote	Shared cache across servers, sessions, pub/sub, leaderboards	LRU, LFU, TTL	RDB + AOF snapshots	<1ms
Memcached	In-memory, remote	Simple string/binary cache, horizontal scaling	LRU only	None (ephemeral)	<1ms
CDN (Cloudflare, CloudFront)	Edge cache, HTTP	Static assets, images, public API responses	TTL + Cache-Control	Edge PoP distributed	~5-50ms
In-Process (L1)	Local memory (JVM heap, Python dict)	Hottest data, config, feature flags	LRU (Caffeine, Guava)	None (process-local)	<0.1ms
Varnish / Nginx proxy cache	HTTP reverse proxy cache	Full-page caching, API response caching	TTL + PURGE	Disk or memory	<5ms

🔑

Redis vs Memcached — which to choose?

Choose Redis in almost all cases. Redis has sorted sets (leaderboards), lists (timelines), pub/sub, Lua scripting, and persistence. Memcached is simpler/faster for pure string cache on multi-threaded loads — but Redis has mostly closed this gap.

🌐

When does CDN caching apply?

Any response that is: public (not user-specific) AND cacheable (same response for same URL) AND read-heavy. Images, JS/CSS, API responses without auth. Set Cache-Control: public, max-age=3600. Never cache auth tokens or personal data.

⚡

When to use in-process (L1) cache?

For data accessed thousands of times per second by one server instance. Feature flags, config, static lookup tables (country codes, currency rates). Synced from Redis on TTL or via pub/sub invalidation. Beware: stale data risk if you have multiple servers.

🛡️

Cache-Aside vs Write-Through vs Write-Behind?

Cache-aside: app manages cache manually (most common). Write-through: write to cache AND DB atomically (no stale reads). Write-behind: write to cache, async flush to DB (low latency, risk of data loss). Choose based on consistency requirements.

Cache Strategy Decision

Read >100× write ratio? → Use cache-aside. Need zero stale reads on write? → Write-through. Need ultra-low write latency + can tolerate data loss? → Write-behind. Most systems: cache-aside + TTL-based expiry + event-driven invalidation.

Warning: Cache Stampede

When a popular key expires, all threads miss the cache simultaneously and hammer the DB. Fix: probabilistic early expiration, Redis NX mutex lock, or stale-while-revalidate pattern. Always design for what happens on cache miss.

Section 3

Message Queue Selection

Every distributed system needs async communication. The choice of message queue determines your throughput ceiling, delivery guarantees, and operational complexity.

Technology	Model	Throughput	Delivery	Retention	Best For
Apache Kafka	Distributed log (pub/sub)	1M+ msg/sec per cluster	At-least-once (exactly-once with transactions)	Days to forever (log compaction)	Event streaming, audit logs, CDC, fan-out at scale
RabbitMQ	Message broker (AMQP)	50-100K msg/sec	At-least-once; exactly-once possible	Until consumed (queue-based)	Task queues, RPC, complex routing (fanout, topic, direct)
AWS SQS	Managed queue	Unlimited (standard); ordered (FIFO)	At-least-once (standard); exactly-once (FIFO)	Up to 14 days	Decoupling AWS services, simple task queues, Lambda triggers
Redis Pub/Sub	In-memory pub/sub	~100K msg/sec	At-most-once (fire and forget, no persistence)	None (ephemeral)	Real-time notifications, presence updates, invalidation signals
Redis Streams	Append-only log (in-memory)	~100K msg/sec	At-least-once (consumer groups)	Configurable (MAXLEN)	Small-scale event streaming, activity feeds, audit trail
Google Pub/Sub	Managed pub/sub	Millions msg/sec	At-least-once	7 days default (configurable)	GCP-native event streaming, global fan-out

📊

Kafka vs RabbitMQ — the defining question

Kafka: you need replay, multiple consumers reading the same message, event sourcing, high throughput (>100K/s), or long retention. RabbitMQ: you need complex routing, task-per-consumer model, or moderate throughput. Kafka is a log; RabbitMQ is a broker.

🔁

When does "at-least-once" matter?

Always design for at-least-once. Even "exactly-once" Kafka requires idempotent consumers. Your consumer MUST be idempotent: processing the same message twice should produce the same result. Use a dedup key (message ID) to achieve effectively-once semantics.

⚡

Redis Pub/Sub vs Kafka for notifications?

Redis Pub/Sub: if consumer is offline, message is lost. Use for ephemeral notifications (presence, invalidation) where missing one is acceptable. Kafka: durable, consumers can replay from offset. Use for business-critical events (payments, orders, user actions).

☁️

When to choose SQS over Kafka?

SQS when: on AWS, you want zero ops burden, each message processed by exactly one worker (task queue model), or you're triggering Lambda. Kafka when: multiple consumer groups read the same messages, you need replay, or cross-service streaming. SQS has no replay — deleted on consume.

Interview Default

In system design interviews, Kafka is usually the right answer for async communication at scale. It handles fan-out (multiple consumers), replay, high throughput, and is used by Uber, Netflix, LinkedIn, Twitter. Say "Kafka for the event bus" and you're on solid ground.

Section 4

API Protocol Selection

The protocol you choose determines latency, streaming capability, client compatibility, and development friction. No single protocol wins in every context.

Protocol	Transport	Bi-Directional?	Best For	Weaknesses
REST (HTTP/1.1)	HTTP, JSON	No (request/response)	Public APIs, browser clients, simple CRUD, ubiquitous compatibility	Over-fetching/under-fetching; multiple round-trips; no streaming
GraphQL	HTTP, JSON	Subscriptions via WS	Mobile APIs (bandwidth-sensitive), multi-entity queries, rapidly evolving schema	Complex caching; N+1 query problem; over-engineering for simple APIs
gRPC	HTTP/2, Protocol Buffers	Yes (bidirectional streaming)	Internal microservice-to-microservice, high throughput, polyglot services	Browser support limited; binary protocol (harder to debug); requires codegen
WebSocket	TCP (upgraded HTTP)	Yes (full duplex)	Real-time messaging (chat, gaming, trading), live feeds, presence	Stateful connections (load balancer routing); hard to scale; no request/response semantics
Server-Sent Events (SSE)	HTTP/1.1	Server → Client only	Live feeds, notifications, progress updates	One-direction only; limited browser connections; no binary support
Long Polling	HTTP	Simulated	Fallback when WebSocket unavailable; simple notification systems	High server resource usage; latency up to poll interval; inefficient

🌍

Public external API?

→ REST. Universally understood, easy to document (OpenAPI), simple to call from any language. Unless your clients have severe bandwidth constraints → GraphQL. REST is the industry default for good reason.

⚡

Internal service-to-service calls?

→ gRPC. 7-10× faster than REST+JSON due to HTTP/2 multiplexing + Protobuf binary encoding. Strong typing via Protobuf schema prevents API drift. Used by: Google, Uber, Netflix, Square internally.

💬

Real-time bidirectional communication?

→ WebSocket for chat, gaming, collaborative editing. But: connection is stateful — need sticky sessions or connection registry. SSE if server pushes only (dashboards, live feeds). Much simpler to implement and scales better.

📱

Mobile app with bandwidth sensitivity?

→ GraphQL or gRPC. GraphQL: fetch exactly the fields you need (reduce payload). gRPC: Protobuf encoding is 5-10× smaller than JSON. Both reduce data usage. Facebook, Twitter, Shopify use GraphQL for mobile APIs.

Decision Hierarchy

Public API → REST. Internal services → gRPC. Real-time push → WebSocket (bidirectional) or SSE (server-to-client). Mobile-optimized → GraphQL or gRPC. Never use WebSocket where SSE is sufficient — simpler is better.

Section 5

Storage Type Selection

Databases store structured records. But images, videos, logs, and backups need different storage. Choosing the wrong storage type means either wasting money or hitting hard limits.

Type	Examples	Access Pattern	Cost	Best For
Object Storage	S3, GCS, Azure Blob	HTTP GET/PUT by key	$0.02-0.03/GB/month	Images, videos, backups, logs, ML datasets, static assets
Block Storage	AWS EBS, GCP PD	Random read/write (disk-like)	$0.10-0.15/GB/month	Database disks, OS volumes, low-latency random I/O
File Storage (NFS)	AWS EFS, GCP Filestore	POSIX file system, shared mount	$0.30/GB/month	Shared file system (CMS, shared config), legacy apps needing POSIX
CDN Edge Cache	Cloudflare, CloudFront, Fastly	HTTP GET (cached at edge PoP)	$0.005-0.085/GB egress	Static assets, video segments (HLS), public API responses
In-Memory Storage	Redis, Memcached	Key-based, <1ms	$0.40-0.80/GB/month	Hot cache, sessions, ephemeral data (not primary storage)

📸

Storing user-uploaded images or videos?

→ Object Storage (S3) + CDN. Never store binary files in a database. S3: cheap, durable (11 nines), infinitely scalable. Pair with CloudFront/CDN for global delivery. Store only the S3 key in your DB row.

🗄️

What storage does my database use?

→ Block Storage (EBS/PD). Databases (Postgres, MySQL, Cassandra) need low-latency random I/O — only block storage provides this. Use provisioned IOPS (gp3/io2) for high-throughput databases. Not object storage (too high latency for random reads).

📋

Sharing files across multiple servers?

→ Object Storage or NFS (EFS). EFS = POSIX shared mount (expensive, legacy). S3 = modern approach — servers fetch files by key over HTTP. S3 is stateless and infinitely scalable. EFS is convenient but 10× the cost of S3.

The Rule of Three Storage Tiers

Hot data (active users, recent transactions) → Redis/SSD. Warm data (past 30 days) → PostgreSQL/EBS. Cold data (archives, logs, media) → S3 with Glacier or Intelligent-Tiering. Each tier is ~10× cheaper than the previous.

Section 6

Architecture Pattern Selection

The most expensive decisions to reverse. Monolith vs. microservices, sync vs. async, stateful vs. stateless — these shape your team structure and deployment model for years.

Pattern	Complexity	Scale	Team Size	Use When	Avoid When
Monolith	Low	Up to ~100K DAU	1-20 engineers	Early stage, small team, rapid iteration, unclear domain	Different services need different scaling, independent deployment
Modular Monolith	Medium	Up to ~1M DAU	20-100 engineers	Clear domain boundaries, want monolith simplicity with structure	Services truly need independent scaling or different tech stacks
Microservices	High	Unlimited (per service)	100+ engineers	Different scaling needs per service, large team, clear domain boundaries	Small team (<20), unclear domain, no DevOps/SRE capacity
Event-Driven (CQRS/ES)	Very High	Very high read throughput	50+ engineers	Read/write ratio >100:1, audit log required, temporal queries	Simple CRUD, teams unfamiliar with eventual consistency
Serverless (FaaS)	Medium	Spiky/unpredictable	Any	Bursty workloads, event-triggered tasks, minimal ops overhead desired	Low-latency (cold starts), stateful, sustained high throughput (expensive)

🏗️

Monolith first or microservices first?

→ Monolith first. Always. Even Amazon, Netflix, Uber, Shopify started as monoliths. "Majestic monolith" beats premature microservices. Extract services only when: a single component has a different scaling requirement, teams own clear domain boundaries, or the monolith is a deployment bottleneck.

🔄

Synchronous vs Asynchronous communication?

Sync (REST/gRPC): when the client needs the response immediately (user-facing reads, auth). Async (Kafka/SQS): when the work can happen later (email, transcoding, analytics, fan-out). Rule: if user is waiting, go sync. If work can happen "eventually," go async.

📦

Containers vs Serverless vs VMs?

Containers (Kubernetes): persistent services, predictable load, portability. Serverless: spiky/event-driven, zero ops, short-lived tasks. VMs: legacy apps, OS-level control, database servers. Most modern apps: containers for services + serverless for background jobs.

⚖️

CQRS vs Regular DB — when does it pay off?

CQRS pays off when: read/write ratio >100:1, you need different read models than write models, or you need event sourcing for audit/compliance. Cost: eventual consistency, multiple data stores, complex code. Most systems don't need CQRS. Start with a single DB with read replicas.

Conway's Law

"Organizations design systems that mirror their communication structure." Small team with 3 engineers → monolith. 300-person company with 30 squads → microservices. Architecture follows team topology, not the other way around.

Section 7

Numbers Every Engineer Should Know

These numbers are your foundation for back-of-envelope estimation. Memorize them. Interviewers expect you to justify architecture decisions with quantitative reasoning.

Redis

~100K ops/sec

Single-threaded, in-memory. GET/SET at 0.1ms p99. Pipeline: up to 1M/sec.

Kafka

1M+ msg/sec

Per cluster. Horizontal scale by adding partitions. p99 produce latency: ~5ms.

PostgreSQL

~10K writes/sec

Single primary. With read replicas: 100K+ reads/sec. Index-based reads: <1ms.

Cassandra

100K+ writes/sec

Per node. Linearly scalable. Write path: memtable → SSTable (sequential write = fast).

SSD Read

~0.1ms (100μs)

NVMe SSD sequential read. Random read: ~0.5ms. 1000× faster than HDD.

HDD Read

~10ms

Rotational disk seek time. Sequential reads ~200MB/s. Avoid for random I/O.

L1 Cache (CPU)

~0.5ns

On-die cache. L2: ~7ns. L3: ~40ns. RAM: ~100ns. All much faster than disk.

RAM Access

~100ns

100 nanoseconds = 0.0001ms. This is why Redis (in-memory) is 1000× faster than disk.

LAN Round Trip

~0.5ms

Same data center, same region. Service-to-service calls: 0.5-2ms typical.

Cross-Region

~50-150ms

US-East to EU-West: ~80ms. US to APAC: ~150ms. Speed of light across oceans.

CDN Edge

~5-50ms

Cloudflare/Fastly: ~300 PoPs. User to nearest PoP: typically <20ms anywhere.

Elasticsearch

~5K writes/sec

Per node with indexing. Reads (search): 50-100ms p50. Scale reads with replicas.

HTTP Request

~100-500 bytes

Headers + body. JSON overhead: ~3-5× vs binary. gRPC proto: 5-10× smaller than REST JSON.

Typical API

<100ms p99

Target SLA for user-facing APIs. >300ms users perceive as "slow". >1s users abandon.

AWS S3

5TB max object

PUT: 5500 req/s per prefix. GET: 5500 req/s per prefix. Multi-part for >100MB files.

TCP Connection

~8KB kernel memory

1M TCP connections = ~8GB RAM. WebSocket connections: similar. Node.js 10K concurrent connections per process (event loop).

Estimation Cheat Sheet

1 day = 86,400 seconds ≈ 100K. 1M DAU × 10 actions/day = 10M daily events = 100 events/sec avg. Peak: 5× average = 500/sec. 1TB = 1,000 GB. 1 billion records × 100 bytes = 100GB. Bandwidth: 1Gbps = 125MB/s.

Section 8

The 12 Golden Rules of Technology Selection

These rules encode hard-won industry knowledge. They are not absolute laws — but violating them requires an excellent justification.

01

Start with PostgreSQL

Default to SQL. Its ACID guarantees and query flexibility solve 80% of problems. Migrate to NoSQL only when you have a proven, specific, production bottleneck — not a hypothetical one.

02

Measure before you optimize

Never add caching, sharding, or replication speculatively. Profile the actual bottleneck. "DB is slow" is not a bottleneck — "this query at 5K QPS takes 800ms" is.

03

Caching is not a silver bullet

A cache solves read-heavy load. It doesn't fix write-heavy load, bad query design, or missing indexes. Always fix the query first. Cache the proven bottleneck second.

04

Monolith before microservices

Premature service decomposition creates distributed system complexity without scale benefits. Every microservice introduces network calls, distributed transactions, and independent deployments. Earn your microservices.

05

Use managed services over self-hosted

Running Kafka yourself = 6 months of ops work. AWS MSK = 2 hours. Use managed services (RDS, MSK, ElastiCache, SQS) unless you have specific requirements that justify the operational cost.

06

Design for failure, not uptime

Assume every component fails. Design retries, circuit breakers, graceful degradation. The question is never "will it fail?" but "what happens when it does?" Idempotency and at-least-once semantics everywhere.

07

Async everything that doesn't need to be sync

Sending email, processing payments, transcoding video, sending analytics — none of these need to block the HTTP response. Async decoupling reduces latency, increases resilience, and enables retry.

08

Data has gravity

Moving data is expensive and risky. Design your data model before your service boundaries. Services should not need to query across each other's databases — that's a sign of a wrong decomposition.

09

CAP: you must choose

During a network partition, you choose CP (consistency + partition tolerance: Cassandra with QUORUM) or AP (availability + partition tolerance: DynamoDB with eventual consistency). Know which your data requires.

10

Idempotency is non-negotiable

Any distributed operation can be retried. Any API call can be duplicated by networks, proxies, or retry logic. Idempotent operations (same result on repeat) are the difference between safety and data corruption.

11

The write path is usually the hard part

Reads can be scaled with caches and replicas. Writes must go to a primary — that's where contention, locking, and bottlenecks live. Design your write path first. The read path usually follows naturally.

12

Don't be clever, be boring

Boring technology choices (Postgres, Redis, Kafka, S3, Nginx) are battle-tested, well-documented, and easy to hire for. Clever technology choices are resume-driven and create knowledge gaps at 3am during an incident.

TechnologyDecision Guide