Database Connection Pooling
PgBouncer, connection limits, pool modes, and the N+1 query problem.
4 Exercises
12 Concept Checks
~85 min total
System Design
Session Progress
0 / 4 completed
Connection Pool Configuration
PostgreSQL max_connections=100. Your app has 50 Node.js servers × 20 connection pool size = 1,000 connections trying to connect. PostgreSQL rejects 900 connections with "FATAL: remaining connection slots are reserved for non-replication superuser connections."
Problem vs Solution
❌ Without Pooler
50 servers × pool=20
→ 1,000 connections
PostgreSQL (max 100)
ERROR: too many clients
✅ With PgBouncer
50 servers → PgBouncer
→ pool_size=20
PostgreSQL (20 real connections)
Concept Check — 3 questions
Q1. The fix for exceeding PostgreSQL max_connections is?
AIncrease PostgreSQL max_connections to 1,000 to match app demand
BAdd PgBouncer as a connection pooler — multiplexes N app connections to M DB connections
CReduce the number of Node.js servers from 50 to 5
DSwitch from PostgreSQL to a database without connection limits
Q2. PgBouncer transaction mode vs session mode — what is the key difference?
ANo functional difference — both modes behave identically
BTransaction mode is always slower than session mode
CTransaction mode releases the DB connection after each transaction; session mode holds the connection for the entire session duration
DSession mode does not support database transactions
Q3. With PgBouncer pool_size=20 and 50 app servers each wanting 20 connections, actual PostgreSQL connections are?
A20 — PgBouncer maintains a pool of 20 DB connections shared across all app servers
B1,000 — one per app connection request
C50 — one per app server
D100 — PostgreSQL's max_connections setting
PgBouncer pool_size=40 (leave headroom below DB max of 50). Transaction mode: each transaction gets a fresh connection from the pool so prepared statements don't persist between transactions — use session mode if you rely heavily on prepared statements. Monitor: pool_free, pool_used, cl_waiting — alert when wait_time exceeds 10ms indicating pool saturation.
Open Design Challenge
Design PgBouncer config: app has 100 servers, each with 10-connection pool. DB can handle max 50 connections. What pool_size and max_client_conn settings do you use?
Transaction mode breaks when using prepared statements. Why? What is the underlying mechanism that causes this failure?
How do you monitor connection pool saturation? What metrics do you alert on and at what thresholds?
Concept score: 0/3
N+1 Query Detection and Fix
A Django API lists 100 blog posts. Each post needs the author's name. Naive ORM code: fetch 100 posts (1 query), then for each post fetch the author (100 queries) = 101 queries total. At 10 requests/sec = 1,010 DB queries/sec. The DB maxes out at 500 QPS and starts dropping connections.
N+1 Query Problem Visualized
API request
→
1 query: SELECT * FROM posts LIMIT 100
→
100 queries: SELECT * FROM users WHERE id=X (loop)
→
101 total DB queries
Concept Check — 3 questions
Q1. The N+1 problem: 1 query fetches N records, then N additional queries fetch related data. The standard fix is?
AAdd more database connections to handle the extra query load
BEager loading with JOIN or SELECT IN — fetch posts and authors in one query
CCache each individual author query result in Redis
DReduce posts per page from 100 to 10 to reduce query count
Q2. "SELECT * FROM posts JOIN users ON posts.author_id = users.id LIMIT 100" reduces 101 queries to?
A1 query — the JOIN fetches posts and authors together in a single SQL statement
B2 queries — one for posts, one for all authors
C101 queries — the JOIN doesn't reduce query count
D100 queries — one per post
Q3. When should you use SELECT IN (batch fetching) instead of a JOIN?
AAlways — SELECT IN is always preferable to JOIN
BWhen tables are small — JOINs are only efficient for large tables
CWhen data is in different microservices or databases where SQL JOINs across services are impossible
DWhen using any NoSQL database
Django fix: Post.objects.select_related('author') generates a JOIN automatically. Cross-service N+1: collect all author_ids from 100 posts → call user-service with GET /users?ids=1,2,3,...,N (one batch call) → map results back to posts. APM trace pattern: a "fan" of identical queries where only the WHERE id=X value changes — this is the N+1 signature in distributed traces.
Open Design Challenge
Write the SQL and Django ORM code for fixing the blog N+1 problem using a JOIN. Show both the raw SQL and the ORM equivalent.
If posts and authors are in different microservices, how do you solve the N+1 without SQL JOINs? Describe the batch fetching flow.
What APM (Application Performance Monitoring) metric reveals N+1 queries in production? Describe the trace pattern you would see.
Concept score: 0/3
Read Replica Routing
A travel booking site has 80% read traffic (search, browse) and 20% write traffic (bookings). All traffic goes to the primary PostgreSQL instance, causing CPU at 90%. Adding 2 read replicas reduces primary CPU — but only if reads are actually routed to replicas.
Read/Write Split Architecture
Application
→
Writes → Primary DB
Reads → Replica 1
Reads → Replica 2
← async replication (lag ~50-100ms)
Primary
Concept Check — 3 questions
Q1. Replication lag is 100ms. A user books a flight (write to primary), then immediately loads a confirmation page (read from replica). What might they see?
AAlways fresh data — replicas are synchronous with the primary
BStale data — the replica hasn't received the write yet, so the booking may not appear
CA database error — replicas reject reads during replication
DCached data from before the booking was made
Q2. The solution to the "read-your-writes" consistency problem is?
ASet a longer cache TTL to hide the replication lag
BAlways read from the primary for all queries — never use replicas
CRoute reads that immediately follow writes to the primary for a short window (sticky reads)
DIncrease replica count to reduce replication lag
Q3. Synchronous vs asynchronous replication: which is slower for write operations?
ASynchronous — the primary waits for the replica to acknowledge before confirming the write to the client
BAsynchronous — fire-and-forget to replica adds more overhead
CSame speed — replication mode doesn't affect write latency
DDepends on network — only matters across data centers
Write detection in middleware: intercept queries starting with INSERT, UPDATE, DELETE, BEGIN, or COMMIT — route these to primary. Pin reads to primary for 500ms-2s after the last write — this covers typical async replication lag. Replica lag detection: query pg_stat_replication.replay_lag on the primary; if lag exceeds threshold (e.g., 5 seconds), remove that replica from the read pool.
Open Design Challenge
Design a read/write splitting middleware layer: how does it intercept queries and route writes to primary vs reads to replicas?
After a write, how long should reads be "pinned" to the primary for read-your-writes consistency? How do you track this per-user?
A replica falls 30 seconds behind due to a slow disk. How do you detect this lag and route traffic around the stale replica?
Concept score: 0/3
Connection Pool Tuning Under Load
During Black Friday, a retail site processes 50,000 checkout requests/minute. Each checkout requires 5 DB queries × 10ms avg = 50ms DB time. With pool_size=10, requests queue behind connections. Pool queue depth reaches 500, causing 5-second wait times.
Little's Law: Sizing the Pool
50K req/min = 833 req/sec
× 50ms DB time
= 42 concurrent connections needed
Pool of 10
bottleneck: 32 connections short
Concept Check — 3 questions
Q1. Little's Law: L = λ × W. At 833 requests/sec, each needing 50ms DB time, how many concurrent DB connections are needed?
A10 — current pool size is sufficient
B20 — double the current pool is enough
C42 — derived from L = 833 req/sec × 0.05 sec = 41.65 ≈ 42
D833 — one connection per concurrent request
Q2. Connection pool saturation appears as which symptom in application latency metrics?
ACPU spike on the application servers — pool saturation causes CPU overhead
BIncreasing p99 latency while p50 stays flat — slow tail latency from connection queuing
COut-of-memory errors on application servers
DNetwork bandwidth saturation on the DB server
Q3. Maximum PostgreSQL connections before performance degrades (too much context switching) on a 16-core server is approximately?
A16 — one connection per CPU core for optimal performance
B100-200 — beyond this, connection management overhead reduces throughput
C10,000 — PostgreSQL scales linearly with connections
DUnlimited — PostgreSQL has no connection overhead
Little's Law: 500 req/sec × 0.02 sec = 10 concurrent connections minimum. Add 50% headroom: pool_size=15. Circuit breaker for pool exhaustion: when all connections are busy and queue depth exceeds threshold, return 503 immediately rather than queuing indefinitely — fast failure is better than slow timeouts. Prepared statements: in transaction mode, each transaction may get a different connection so the prepared statement doesn't persist — use session mode for prepared statement-heavy workloads.
Open Design Challenge
Use Little's Law to size the connection pool for a service with 500 req/sec and 20ms average DB time per request. Include 50% headroom.
Design a circuit breaker for connection pool exhaustion: what happens when all connections are busy? Define thresholds and fallback behavior.
How do prepared statement caches interact with PgBouncer in transaction mode? Why does this cause failures and what's the fix?
Concept score: 0/3