Day 4 Exercises — Database Connection Pooling

Exercise 1 🟡 Easy ⏱ 15 min

✓ Completed

Connection Pool Configuration

PostgreSQL max_connections=100. Your app has 50 Node.js servers × 20 connection pool size = 1,000 connections trying to connect. PostgreSQL rejects 900 connections with "FATAL: remaining connection slots are reserved for non-replication superuser connections."

Problem vs Solution

❌ Without Pooler

50 servers × pool=20

→ 1,000 connections

PostgreSQL (max 100)

ERROR: too many clients

✅ With PgBouncer

50 servers → PgBouncer

→ pool_size=20

PostgreSQL (20 real connections)

Concept Check — 3 questions

Q1. The fix for exceeding PostgreSQL max_connections is?

AIncrease PostgreSQL max_connections to 1,000 to match app demand

BAdd PgBouncer as a connection pooler — multiplexes N app connections to M DB connections

CReduce the number of Node.js servers from 50 to 5

DSwitch from PostgreSQL to a database without connection limits

Q2. PgBouncer transaction mode vs session mode — what is the key difference?

ANo functional difference — both modes behave identically

BTransaction mode is always slower than session mode

CTransaction mode releases the DB connection after each transaction; session mode holds the connection for the entire session duration

DSession mode does not support database transactions

Q3. With PgBouncer pool_size=20 and 50 app servers each wanting 20 connections, actual PostgreSQL connections are?

A20 — PgBouncer maintains a pool of 20 DB connections shared across all app servers

B1,000 — one per app connection request

C50 — one per app server

D100 — PostgreSQL's max_connections setting

PgBouncer pool_size=40 (leave headroom below DB max of 50). Transaction mode: each transaction gets a fresh connection from the pool so prepared statements don't persist between transactions — use session mode if you rely heavily on prepared statements. Monitor: pool_free, pool_used, cl_waiting — alert when wait_time exceeds 10ms indicating pool saturation.

Open Design Challenge

1

Design PgBouncer config: app has 100 servers, each with 10-connection pool. DB can handle max 50 connections. What pool_size and max_client_conn settings do you use?

2

Transaction mode breaks when using prepared statements. Why? What is the underlying mechanism that causes this failure?

3

How do you monitor connection pool saturation? What metrics do you alert on and at what thresholds?

Concept score: 0/3

Exercise 2 🟡 Easy ⏱ 20 min

✓ Completed

N+1 Query Detection and Fix

A Django API lists 100 blog posts. Each post needs the author's name. Naive ORM code: fetch 100 posts (1 query), then for each post fetch the author (100 queries) = 101 queries total. At 10 requests/sec = 1,010 DB queries/sec. The DB maxes out at 500 QPS and starts dropping connections.

N+1 Query Problem Visualized

API request

→

1 query: SELECT * FROM posts LIMIT 100

→

100 queries: SELECT * FROM users WHERE id=X (loop)

→

101 total DB queries

Concept Check — 3 questions

Q1. The N+1 problem: 1 query fetches N records, then N additional queries fetch related data. The standard fix is?

AAdd more database connections to handle the extra query load

BEager loading with JOIN or SELECT IN — fetch posts and authors in one query

CCache each individual author query result in Redis

DReduce posts per page from 100 to 10 to reduce query count

Q2. "SELECT * FROM posts JOIN users ON posts.author_id = users.id LIMIT 100" reduces 101 queries to?

A1 query — the JOIN fetches posts and authors together in a single SQL statement

B2 queries — one for posts, one for all authors

C101 queries — the JOIN doesn't reduce query count

D100 queries — one per post

Q3. When should you use SELECT IN (batch fetching) instead of a JOIN?

AAlways — SELECT IN is always preferable to JOIN

BWhen tables are small — JOINs are only efficient for large tables

CWhen data is in different microservices or databases where SQL JOINs across services are impossible

DWhen using any NoSQL database

Django fix: Post.objects.select_related('author') generates a JOIN automatically. Cross-service N+1: collect all author_ids from 100 posts → call user-service with GET /users?ids=1,2,3,...,N (one batch call) → map results back to posts. APM trace pattern: a "fan" of identical queries where only the WHERE id=X value changes — this is the N+1 signature in distributed traces.

Open Design Challenge

1

Write the SQL and Django ORM code for fixing the blog N+1 problem using a JOIN. Show both the raw SQL and the ORM equivalent.

2

If posts and authors are in different microservices, how do you solve the N+1 without SQL JOINs? Describe the batch fetching flow.

3

What APM (Application Performance Monitoring) metric reveals N+1 queries in production? Describe the trace pattern you would see.

Concept score: 0/3

Exercise 3 🔴 Medium ⏱ 25 min

✓ Completed

Read Replica Routing

A travel booking site has 80% read traffic (search, browse) and 20% write traffic (bookings). All traffic goes to the primary PostgreSQL instance, causing CPU at 90%. Adding 2 read replicas reduces primary CPU — but only if reads are actually routed to replicas.

Read/Write Split Architecture

Application

→

Writes → Primary DB

Reads → Replica 1

Reads → Replica 2

← async replication (lag ~50-100ms)

Primary

Concept Check — 3 questions

Q1. Replication lag is 100ms. A user books a flight (write to primary), then immediately loads a confirmation page (read from replica). What might they see?

AAlways fresh data — replicas are synchronous with the primary

BStale data — the replica hasn't received the write yet, so the booking may not appear

CA database error — replicas reject reads during replication

DCached data from before the booking was made

Q2. The solution to the "read-your-writes" consistency problem is?

ASet a longer cache TTL to hide the replication lag

BAlways read from the primary for all queries — never use replicas

CRoute reads that immediately follow writes to the primary for a short window (sticky reads)

DIncrease replica count to reduce replication lag

Q3. Synchronous vs asynchronous replication: which is slower for write operations?

ASynchronous — the primary waits for the replica to acknowledge before confirming the write to the client

BAsynchronous — fire-and-forget to replica adds more overhead

CSame speed — replication mode doesn't affect write latency

DDepends on network — only matters across data centers

Write detection in middleware: intercept queries starting with INSERT, UPDATE, DELETE, BEGIN, or COMMIT — route these to primary. Pin reads to primary for 500ms-2s after the last write — this covers typical async replication lag. Replica lag detection: query pg_stat_replication.replay_lag on the primary; if lag exceeds threshold (e.g., 5 seconds), remove that replica from the read pool.

Open Design Challenge

1

Design a read/write splitting middleware layer: how does it intercept queries and route writes to primary vs reads to replicas?

2

After a write, how long should reads be "pinned" to the primary for read-your-writes consistency? How do you track this per-user?

3

A replica falls 30 seconds behind due to a slow disk. How do you detect this lag and route traffic around the stale replica?

Concept score: 0/3

Exercise 4 🔥 Hard ⏱ 30 min

✓ Completed

Connection Pool Tuning Under Load

During Black Friday, a retail site processes 50,000 checkout requests/minute. Each checkout requires 5 DB queries × 10ms avg = 50ms DB time. With pool_size=10, requests queue behind connections. Pool queue depth reaches 500, causing 5-second wait times.

Little's Law: Sizing the Pool

50K req/min = 833 req/sec

× 50ms DB time

= 42 concurrent connections needed

Pool of 10

bottleneck: 32 connections short

Concept Check — 3 questions

Q1. Little's Law: L = λ × W. At 833 requests/sec, each needing 50ms DB time, how many concurrent DB connections are needed?

A10 — current pool size is sufficient

B20 — double the current pool is enough

C42 — derived from L = 833 req/sec × 0.05 sec = 41.65 ≈ 42

D833 — one connection per concurrent request

Q2. Connection pool saturation appears as which symptom in application latency metrics?

ACPU spike on the application servers — pool saturation causes CPU overhead

BIncreasing p99 latency while p50 stays flat — slow tail latency from connection queuing

COut-of-memory errors on application servers

DNetwork bandwidth saturation on the DB server

Q3. Maximum PostgreSQL connections before performance degrades (too much context switching) on a 16-core server is approximately?

A16 — one connection per CPU core for optimal performance

B100-200 — beyond this, connection management overhead reduces throughput

C10,000 — PostgreSQL scales linearly with connections

DUnlimited — PostgreSQL has no connection overhead

Little's Law: 500 req/sec × 0.02 sec = 10 concurrent connections minimum. Add 50% headroom: pool_size=15. Circuit breaker for pool exhaustion: when all connections are busy and queue depth exceeds threshold, return 503 immediately rather than queuing indefinitely — fast failure is better than slow timeouts. Prepared statements: in transaction mode, each transaction may get a different connection so the prepared statement doesn't persist — use session mode for prepared statement-heavy workloads.

Open Design Challenge

1

Use Little's Law to size the connection pool for a service with 500 req/sec and 20ms average DB time per request. Include 50% headroom.

2

Design a circuit breaker for connection pool exhaustion: what happens when all connections are busy? Define thresholds and fallback behavior.

3

How do prepared statement caches interact with PgBouncer in transaction mode? Why does this cause failures and what's the fix?

Concept score: 0/3

Day 4 Complete 🎉