Database Connection Pooling

PgBouncer, connection limits, pool modes, and the N+1 query problem.

4 Exercises
12 Concept Checks
~85 min total
System Design
Session Progress
0 / 4 completed
Exercise 1 🟡 Easy ⏱ 15 min
✓ Completed
Connection Pool Configuration
PostgreSQL max_connections=100. Your app has 50 Node.js servers × 20 connection pool size = 1,000 connections trying to connect. PostgreSQL rejects 900 connections with "FATAL: remaining connection slots are reserved for non-replication superuser connections."
Problem vs Solution
❌ Without Pooler
50 servers × pool=20
→ 1,000 connections
PostgreSQL (max 100)
ERROR: too many clients
✅ With PgBouncer
50 servers → PgBouncer
→ pool_size=20
PostgreSQL (20 real connections)
Concept Check — 3 questions
Q1. The fix for exceeding PostgreSQL max_connections is?
AIncrease PostgreSQL max_connections to 1,000 to match app demand
BAdd PgBouncer as a connection pooler — multiplexes N app connections to M DB connections
CReduce the number of Node.js servers from 50 to 5
DSwitch from PostgreSQL to a database without connection limits
Q2. PgBouncer transaction mode vs session mode — what is the key difference?
ANo functional difference — both modes behave identically
BTransaction mode is always slower than session mode
CTransaction mode releases the DB connection after each transaction; session mode holds the connection for the entire session duration
DSession mode does not support database transactions
Q3. With PgBouncer pool_size=20 and 50 app servers each wanting 20 connections, actual PostgreSQL connections are?
A20 — PgBouncer maintains a pool of 20 DB connections shared across all app servers
B1,000 — one per app connection request
C50 — one per app server
D100 — PostgreSQL's max_connections setting
PgBouncer pool_size=40 (leave headroom below DB max of 50). Transaction mode: each transaction gets a fresh connection from the pool so prepared statements don't persist between transactions — use session mode if you rely heavily on prepared statements. Monitor: pool_free, pool_used, cl_waiting — alert when wait_time exceeds 10ms indicating pool saturation.
Open Design Challenge
1
Design PgBouncer config: app has 100 servers, each with 10-connection pool. DB can handle max 50 connections. What pool_size and max_client_conn settings do you use?
2
Transaction mode breaks when using prepared statements. Why? What is the underlying mechanism that causes this failure?
3
How do you monitor connection pool saturation? What metrics do you alert on and at what thresholds?
Concept score: 0/3
Exercise 2 🟡 Easy ⏱ 20 min
✓ Completed
N+1 Query Detection and Fix
A Django API lists 100 blog posts. Each post needs the author's name. Naive ORM code: fetch 100 posts (1 query), then for each post fetch the author (100 queries) = 101 queries total. At 10 requests/sec = 1,010 DB queries/sec. The DB maxes out at 500 QPS and starts dropping connections.
N+1 Query Problem Visualized
API request
1 query: SELECT * FROM posts LIMIT 100
100 queries: SELECT * FROM users WHERE id=X (loop)
101 total DB queries
Concept Check — 3 questions
Q1. The N+1 problem: 1 query fetches N records, then N additional queries fetch related data. The standard fix is?
AAdd more database connections to handle the extra query load
BEager loading with JOIN or SELECT IN — fetch posts and authors in one query
CCache each individual author query result in Redis
DReduce posts per page from 100 to 10 to reduce query count
Q2. "SELECT * FROM posts JOIN users ON posts.author_id = users.id LIMIT 100" reduces 101 queries to?
A1 query — the JOIN fetches posts and authors together in a single SQL statement
B2 queries — one for posts, one for all authors
C101 queries — the JOIN doesn't reduce query count
D100 queries — one per post
Q3. When should you use SELECT IN (batch fetching) instead of a JOIN?
AAlways — SELECT IN is always preferable to JOIN
BWhen tables are small — JOINs are only efficient for large tables
CWhen data is in different microservices or databases where SQL JOINs across services are impossible
DWhen using any NoSQL database
Django fix: Post.objects.select_related('author') generates a JOIN automatically. Cross-service N+1: collect all author_ids from 100 posts → call user-service with GET /users?ids=1,2,3,...,N (one batch call) → map results back to posts. APM trace pattern: a "fan" of identical queries where only the WHERE id=X value changes — this is the N+1 signature in distributed traces.
Open Design Challenge
1
Write the SQL and Django ORM code for fixing the blog N+1 problem using a JOIN. Show both the raw SQL and the ORM equivalent.
2
If posts and authors are in different microservices, how do you solve the N+1 without SQL JOINs? Describe the batch fetching flow.
3
What APM (Application Performance Monitoring) metric reveals N+1 queries in production? Describe the trace pattern you would see.
Concept score: 0/3
Exercise 3 🔴 Medium ⏱ 25 min
✓ Completed
Read Replica Routing
A travel booking site has 80% read traffic (search, browse) and 20% write traffic (bookings). All traffic goes to the primary PostgreSQL instance, causing CPU at 90%. Adding 2 read replicas reduces primary CPU — but only if reads are actually routed to replicas.
Read/Write Split Architecture
Application
Writes → Primary DB
Reads → Replica 1
Reads → Replica 2
← async replication (lag ~50-100ms)
Primary
Concept Check — 3 questions
Q1. Replication lag is 100ms. A user books a flight (write to primary), then immediately loads a confirmation page (read from replica). What might they see?
AAlways fresh data — replicas are synchronous with the primary
BStale data — the replica hasn't received the write yet, so the booking may not appear
CA database error — replicas reject reads during replication
DCached data from before the booking was made
Q2. The solution to the "read-your-writes" consistency problem is?
ASet a longer cache TTL to hide the replication lag
BAlways read from the primary for all queries — never use replicas
CRoute reads that immediately follow writes to the primary for a short window (sticky reads)
DIncrease replica count to reduce replication lag
Q3. Synchronous vs asynchronous replication: which is slower for write operations?
ASynchronous — the primary waits for the replica to acknowledge before confirming the write to the client
BAsynchronous — fire-and-forget to replica adds more overhead
CSame speed — replication mode doesn't affect write latency
DDepends on network — only matters across data centers
Write detection in middleware: intercept queries starting with INSERT, UPDATE, DELETE, BEGIN, or COMMIT — route these to primary. Pin reads to primary for 500ms-2s after the last write — this covers typical async replication lag. Replica lag detection: query pg_stat_replication.replay_lag on the primary; if lag exceeds threshold (e.g., 5 seconds), remove that replica from the read pool.
Open Design Challenge
1
Design a read/write splitting middleware layer: how does it intercept queries and route writes to primary vs reads to replicas?
2
After a write, how long should reads be "pinned" to the primary for read-your-writes consistency? How do you track this per-user?
3
A replica falls 30 seconds behind due to a slow disk. How do you detect this lag and route traffic around the stale replica?
Concept score: 0/3
Exercise 4 🔥 Hard ⏱ 30 min
✓ Completed
Connection Pool Tuning Under Load
During Black Friday, a retail site processes 50,000 checkout requests/minute. Each checkout requires 5 DB queries × 10ms avg = 50ms DB time. With pool_size=10, requests queue behind connections. Pool queue depth reaches 500, causing 5-second wait times.
Little's Law: Sizing the Pool
50K req/min = 833 req/sec
× 50ms DB time
= 42 concurrent connections needed
Pool of 10
bottleneck: 32 connections short
Concept Check — 3 questions
Q1. Little's Law: L = λ × W. At 833 requests/sec, each needing 50ms DB time, how many concurrent DB connections are needed?
A10 — current pool size is sufficient
B20 — double the current pool is enough
C42 — derived from L = 833 req/sec × 0.05 sec = 41.65 ≈ 42
D833 — one connection per concurrent request
Q2. Connection pool saturation appears as which symptom in application latency metrics?
ACPU spike on the application servers — pool saturation causes CPU overhead
BIncreasing p99 latency while p50 stays flat — slow tail latency from connection queuing
COut-of-memory errors on application servers
DNetwork bandwidth saturation on the DB server
Q3. Maximum PostgreSQL connections before performance degrades (too much context switching) on a 16-core server is approximately?
A16 — one connection per CPU core for optimal performance
B100-200 — beyond this, connection management overhead reduces throughput
C10,000 — PostgreSQL scales linearly with connections
DUnlimited — PostgreSQL has no connection overhead
Little's Law: 500 req/sec × 0.02 sec = 10 concurrent connections minimum. Add 50% headroom: pool_size=15. Circuit breaker for pool exhaustion: when all connections are busy and queue depth exceeds threshold, return 503 immediately rather than queuing indefinitely — fast failure is better than slow timeouts. Prepared statements: in transaction mode, each transaction may get a different connection so the prepared statement doesn't persist — use session mode for prepared statement-heavy workloads.
Open Design Challenge
1
Use Little's Law to size the connection pool for a service with 500 req/sec and 20ms average DB time per request. Include 50% headroom.
2
Design a circuit breaker for connection pool exhaustion: what happens when all connections are busy? Define thresholds and fallback behavior.
3
How do prepared statement caches interact with PgBouncer in transaction mode? Why does this cause failures and what's the fix?
Concept score: 0/3