Day 5: Rate Limiting & Traffic Shaping

01 — The Four Algorithms

Same limit, different behavior under burst

All four algorithms limit to "100 requests per minute" — but they behave differently when a user sends 100 requests at once, then waits. Token bucket allows it. Leaky bucket queues them. Fixed window has a boundary vulnerability. Sliding window is smooth but expensive. See them live.

Select an algorithm and start sending requests.

💡

Token Bucket

Tokens refill at a fixed rate (e.g., 10/s). Requests consume one token. If tokens available: allow. If empty: reject (or queue). Allows controlled bursting — if the bucket is full and a burst arrives, all tokens can be consumed at once. Most APIs (Stripe, GitHub) use token bucket because it allows temporary bursts.

02 — Fixed vs Sliding Window

The boundary attack — why fixed window is broken

Fixed window resets its counter at a hard boundary (e.g., every minute on the clock). An attacker who knows the boundary can send 100 requests at T=59s and 100 more at T=61s — 200 requests in 2 seconds, both windows are clean. The sliding window solves this with a rolling time calculation.

03 — Algorithm Deep Dive

When to use each algorithm

The choice of algorithm affects user experience, memory usage, and attack surface. A leaky bucket is deterministic but annoying. Token bucket is user-friendly but complex. The right choice depends on your use case.

Algorithm	Burst Behavior	Memory	Smoothing	Best For	Caveat
Token Bucket	Allows burst up to capacity	O(1)	No	API rate limits, user quotas	Burst can still overload downstream
Leaky Bucket	Queued — smoothed out	O(queue)	Yes	Network QoS, outbound HTTP rate	Adds latency, queue can fill and drop
Fixed Window	Boundary attack: 2× burst	O(1)	No	Simple use cases only	Vulnerable at window boundary
Sliding Window	Strictly enforced	O(requests) or O(1) log approx	Yes	Strict API limits, security	Exact: high memory; Approx: slight inaccuracy

🪣

Leaky Bucket — the detail

Requests enter a FIFO queue (the "bucket"). A constant-rate processor drains the queue (water "leaks" out at a fixed rate). If the queue is full, new requests are dropped. This guarantees a smooth output rate regardless of input bursts. Used in: network packet shaping, Nginx's limit_req module.

🪟

Sliding Window Log

Store a timestamp for every request. On each request, count timestamps in the last 60s. If count < limit: allow and add timestamp. If ≥ limit: reject. Accurate but O(requests) memory — for 1M users at 1000 req/min that's 1B timestamps. In practice: use sliding window counter (approximation) instead.

04 — Redis Rate Limiting Implementation

Why naive Redis INCR breaks — and how Lua fixes it

A simple INCR counter; EXPIRE counter 60 has a race condition: two requests can both check the counter before either increments it, both seeing it as below the limit. Lua scripts are atomic in Redis — they execute as a single unit, eliminating the race.

-- Redis Lua: Sliding Window Counter (atomic, race-free)
-- KEYS[1] = rate limit key (e.g., "rl:user:42")
-- ARGV[1] = current timestamp (ms), ARGV[2] = window (ms), ARGV[3] = limit

local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local window_start = now - window

-- Remove entries older than the window
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)

-- Count requests in window
local count = redis.call('ZCARD', key)

if count < limit then
  -- Allow: add this request's timestamp with score=timestamp
  redis.call('ZADD', key, now, now .. '-' .. math.random(1, 999999))
  -- Set TTL so key expires automatically
  redis.call('EXPIRE', key, math.ceil(window / 1000))
  return {1, limit - count - 1}  -- {allowed, remaining}
else
  return {0, 0}  -- {rejected, remaining}
end
    

# Python: Calling the Lua script from FastAPI
import redis.asyncio as redis
import time

RATE_LIMIT_SCRIPT = """... (Lua script above) ..."""

class RateLimiter:
    def __init__(self, redis_client, limit: int, window_ms: int):
        self.redis = redis_client
        self.limit = limit
        self.window_ms = window_ms
        self.script = self.redis.register_script(RATE_LIMIT_SCRIPT)

    async def is_allowed(self, user_id: str) -> tuple[bool, int]:
        now_ms = int(time.time() * 1000)
        result = await self.script(
            keys=[f"rl:{user_id}"],
            args=[now_ms, self.window_ms, self.limit]
        )
        allowed, remaining = bool(result[0]), result[1]
        return allowed, remaining

# FastAPI middleware
async def rate_limit_middleware(request: Request, call_next):
    user_id = request.headers.get("X-User-ID", request.client.host)
    allowed, remaining = await limiter.is_allowed(user_id)
    if not allowed:
        return JSONResponse(
            {"error": "rate_limit_exceeded"},
            status_code=429,
            headers={"Retry-After": "60", "X-RateLimit-Remaining": "0"}
        )
    response = await call_next(request)
    response.headers["X-RateLimit-Remaining"] = str(remaining)
    return response
    

05 — Distributed Rate Limiting

Rate limiting across 10 app servers

If each app server keeps a local counter, a user can send 100 requests to 10 different servers and never hit any single server's limit. You need a shared counter — which is why Redis is the standard. But Redis adds 1-2ms latency per request. That's acceptable for most APIs but not for sub-millisecond paths.

Mode: Redis Shared

App Server 1

0

requests

App Server 2

0

requests

App Server 3

0

requests

Global Counter (Redis)

0

of 10 limit

Status

ALLOWED

Select a mode and fire requests to see the difference.

🏗️

Sliding Window Counter Approximation — O(1) memory, high accuracy

The exact sliding window needs one entry per request (high memory). The approximation: keep two fixed-window counters (current window + previous window) and estimate: count = prev_count × (1 - elapsed/window) + curr_count. This gives <0.1% error vs exact sliding window at O(1) memory. Cloudflare uses this for their 1M req/s rate limiter.

🏆

Rate limit headers — the standard (RFC 6585 + 9110)

X-RateLimit-Limit: 100 — max requests per window
X-RateLimit-Remaining: 47 — requests left in current window
X-RateLimit-Reset: 1709500800 — Unix timestamp when window resets
Retry-After: 30 — seconds until retry is safe (on 429 response)
Always return these on 429. Clients that don't get Retry-After often hammer you immediately, making the problem worse.

Technology Decision

Rate Limiting Algorithm Selection

✅ Token Bucket — choose when...

You want to allow short bursts (bucket fills up to N tokens). API rate limiting for external users. Stripe, GitHub, Twitter APIs use token bucket. Implemented with Redis DECR + TTL. Most flexible and user-friendly algorithm.

📊 Sliding Window Log — choose when...

You need exact rate limiting (no edge-of-window exploits). Stores each request timestamp in Redis sorted set. ZRANGEBYSCORE to count requests in window. Accurate but uses more memory (O(requests) per user per window).

⚠ Fixed Window Counter — avoid for APIs

Simplest: INCR counter per minute. Problem: "boundary burst" — user can send 100 at 11:59 and 100 at 12:00 = 200 in 2 seconds. Use only when this is acceptable (internal tools). Redis INCR + EXPIRE.

🔃 Leaky Bucket — choose when...

You need a constant output rate (smoothing bursts). Requests enter a queue, processed at fixed rate. Good for: protecting downstream services that can't handle bursts. Network QoS, video encoding queues.

Interview Answer: "I'd implement token bucket in Redis using atomic DECR. Each user has a key with their remaining tokens; a background job refills at rate R/sec using INCRBY with a cap. Return 429 Too Many Requests with Retry-After header when tokens are exhausted."

Quiz — Check Your Understanding

Rate Limiting

Q1. Your API has a limit of 100 requests/minute per user. At 11:59:55 PM, a user sends 100 requests (all allowed). At 12:00:01 AM (6 seconds later), the same user sends 100 more requests. With a fixed window counter, what happens?

Q2. You're implementing rate limiting in Redis using INCR + EXPIRE. Two requests arrive simultaneously at T=0. Both read the counter (value=0), both decide they're below the limit, both INCR (counter becomes 2 briefly, but both see "below limit"). What's wrong and how do you fix it?

Q3. Which rate limiting algorithm guarantees a perfectly smooth output rate (no burst) regardless of input patterns?

Rate Limiting &Traffic Shaping

Same limit, different behavior under burst

Token Bucket

The boundary attack — why fixed window is broken

When to use each algorithm

Leaky Bucket — the detail

Sliding Window Log

Why naive Redis INCR breaks — and how Lua fixes it

Rate limiting across 10 app servers

App Server 1

App Server 2

App Server 3

Sliding Window Counter Approximation — O(1) memory, high accuracy

Rate limit headers — the standard (RFC 6585 + 9110)

Rate Limiting Algorithm Selection

Rate Limiting

Rate Limiting &
Traffic Shaping