Stream Processing & Event Sourcing

Event log vs state, stream processing patterns, event sourcing benefits, CQRS, and read model rebuilding.

4 Exercises
12 Concept Checks
~90 min total
System Design
Session Progress
0 / 4 completed
Exercise 1 🟡 Easy ⏱ 15 min
✓ Completed
Event Sourcing Fundamentals
A banking app stores current account balance ($1,247.53). Traditional approach: UPDATE accounts SET balance=$1,247.53. Event-sourced approach: store the history — Deposited($500), Withdrew($200), Deposited($947.53). Current balance = replay all events. The balance is derived, not stored directly — history is the source of truth.
Event Log as Source of Truth
AccountOpened{$0}
Deposited{$500}
Withdrew{$200}
Deposited{$947.53}
→ replay →
Current state
balance = $1,247.53
Any past state
replayable anytime
Concept Check — 3 questions
Q1. The main advantage of event sourcing over storing only current state?
AFaster reads — event sourcing requires less computation to get current state
BComplete audit trail — you can replay events to reconstruct any past state and understand exactly what happened and when
CLess storage — events are smaller than storing the full current state
DSimpler queries — you can run ad-hoc SQL queries against the event log
Q2. In event sourcing, how do you get the current balance without replaying all events from the beginning?
AStore both events and current state in parallel, always keeping both in sync
BUse a cache layer that expires every hour and forces full replay
CSnapshots — periodically capture current state; at replay time, start from the last snapshot and apply only subsequent events
DBuild an index of events sorted by their balance impact
Q3. Event sourcing is particularly suited for?
AAny CRUD application that needs simple create/read/update/delete operations
BFinancial, audit-required, or compliance domains where complete history and the ability to reconstruct any past state are required
CSocial media applications where speed of reads is the only priority
DRead-heavy analytics where writes are infrequent
Event sourcing stores what happened, not what is. The event log is append-only and immutable — you can never delete or modify a past event. To get current state, replay all events through the aggregate. Performance optimization: snapshots capture a point-in-time state; at startup, load the latest snapshot and replay only subsequent events. Snapshot frequency is tunable: every 100 events, every hour, or on demand. This is critical for accounts with millions of transactions.
Open Design Challenge
1
Design the event schema for a bank account. What fields does each event need? How do you version events when the schema changes in 2 years?
2
Design the snapshot strategy for a high-transaction account (10,000 transactions/day). At what frequency do you snapshot? What does the snapshot contain?
3
A regulatory audit requires the account balance at every second of the last 7 years. How does event sourcing enable this? What's the performance cost?
Concept score: 0/3
Exercise 2 🟡 Easy ⏱ 20 min
✓ Completed
CQRS Pattern
An order management system has a complex domain model (orders, line items, discounts, promotions). Reads need a denormalized view (order summary with product names, total, shipping status). Writes use rich domain objects with business logic. CQRS separates the write model (commands) from the read model (queries) — each optimized independently.
Command Query Responsibility Segregation
Command: CreateOrder
→ OrderAggregate (domain logic)
→ Events: OrderCreated, ItemAdded
→ project →
Query: GetOrderSummary
← Read model (denormalized)
← order_summary table (fast)
Concept Check — 3 questions
Q1. CQRS: Commands vs Queries — what is the fundamental difference?
ACommands are fast; queries are slow — CQRS speeds up queries by separating them
BCommands mutate state (create/update/delete); queries read state without any side effects
CCommands use SQL databases; queries use NoSQL databases
DCommands are synchronous; queries are always asynchronous
Q2. The query side of CQRS is optimized by?
AAdding more database indexes to the write model's tables
BUsing faster CPUs and more RAM on the database server
CPre-computed materialized read models — denormalized, aggregated views stored specifically for each query shape
DCaching all database queries in Redis for 5 minutes
Q3. A read model in CQRS becomes stale when?
AAn event is published but the projection hasn't processed it yet — this is an inherent eventual consistency lag between write and read models
BThe read model's cache TTL expires and needs refreshing
CThe application server restarts and loses in-memory state
DThe read model is never stale — it is always current
CQRS separates the concern of writing correct state from reading state efficiently. The write side uses a rich domain model with business rules. The read side uses flat, denormalized projections tailored to exact query shapes. The projection (event handler) listens to events from the write side and updates the read model asynchronously — this creates eventual consistency. For most UIs, a few milliseconds of lag is invisible to users. CQRS is most valuable when read and write models have fundamentally different shapes.
Open Design Challenge
1
Design the read model schema for an order summary page. What fields does it pre-compute? How is it updated when an order item is added?
2
The projection consumer crashes while processing an OrderShipped event. The read model shows "Processing" but the order is actually shipped. How do you recover?
3
Design a CQRS system where the write model uses PostgreSQL (ACID) and the read model uses Elasticsearch (full-text search). How do you keep them synchronized?
Concept score: 0/3
Exercise 3 🔴 Medium ⏱ 25 min
✓ Completed
Stream Processing Windows
A fraud detection system processes transaction events. Rule: "flag user if they make more than 10 transactions in any 60-second window." A naive approach counts all transactions since account creation — not a true sliding window. Tumbling windows miss bursts that span two buckets. Sliding windows catch every burst regardless of timing.
Window Types Comparison
Tumbling: [0-60s] [60-120s] [120-180s]
Burst at 55-65s: split across 2 windows — missed!
Sliding (60s, 1s step): [0-60] [1-61] [2-62]...
Burst at 55-65s: caught in window [5-65] ✓
Session: gaps define boundaries
Concept Check — 3 questions
Q1. A tumbling window for 60 seconds means?
AOverlapping windows that shift by 1 second, each spanning 60 seconds
BFixed, non-overlapping 60-second buckets: [0-60s], [60-120s], [120-180s] — a burst spanning two buckets may go undetected
CWindows defined dynamically by gaps in user activity (session-based)
DAn infinite accumulation window that never resets
Q2. A sliding window with 60s size and 1s slide counts transactions in the preceding 60s for each second. This is more accurate for fraud detection because?
AIt is simpler to implement than tumbling windows
BIt uses significantly less memory than tumbling windows
CIt detects bursts at any point in time — there is no "safe boundary" that a fraudster can exploit to split a burst across two windows
DIt requires less CPU processing power than tumbling windows
Q3. Apache Flink vs Spark Streaming for fraud detection at 1M events/sec?
ASpark — it has better streaming performance than Flink at high throughput
BFlink — native streaming engine with event-time processing and sub-second latency; Spark Streaming uses micro-batches with higher latency
CThey perform identically for real-time fraud detection workloads
DNeither — use Redis sorted sets only for any real-time window counting
Window types: Tumbling = non-overlapping fixed buckets. Sliding = overlapping windows that shift by step size. Session = activity-based windows bounded by inactivity gaps. For fraud: sliding windows are ideal but memory-intensive (maintain state for each overlapping window). Flink advantage: it processes events as they arrive (true streaming) with millisecond latency and native event-time semantics for out-of-order events. Spark Streaming uses micro-batches (collect events → process batch → repeat) with 0.5-2s minimum latency — acceptable for many use cases but not sub-second fraud detection.
Open Design Challenge
1
Design a Flink job for fraud detection: "flag if user makes >10 transactions in 60 seconds OR total amount >$5000 in 5 minutes." What window types do you use for each rule?
2
Events arrive out-of-order in Kafka (network delays mean event at t=5s arrives after event at t=7s). How does Flink's event-time processing handle this? What is the watermark?
3
Design the alert pipeline: when fraud is detected, the system must block the transaction in <100ms. How do you connect the Flink job to the payment authorization service?
Concept score: 0/3
Exercise 4 🔥 Hard ⏱ 30 min
✓ Completed
Event Replay and Read Model Rebuilding
An e-commerce platform has used event sourcing for 2 years. New feature needed: total lifetime spend per customer. The read model doesn't exist yet — they must replay 500 million historical events to build it. The system must continue serving live traffic during the rebuild, which could take hours.
Read Model Rebuild During Live Traffic
Event log
500M events
→ replay pipeline →
New projection
lifetime_spend per customer
→ catches up →
Switch to live
start serving queries
Concept Check — 3 questions
Q1. Building a new read model from historical events requires?
AA full database migration with ALTER TABLE and data backfill statements
BReplaying the full event log from the beginning through the new projection — typically done offline, then switching to consuming live events
CRewriting all historical events to match the new read model's schema
DRestoring from the latest database backup and transforming it
Q2. During read model rebuild, how do you serve the new "lifetime spend" feature?
ABlock all reads on the lifetime spend feature until the rebuild completes
BReturn partial (incorrect) data to users while the replay is in progress
CReturn a "building" status or null response while replay is in progress — communicate the expected completion time to clients
DUse the old read model as a placeholder, even though it lacks lifetime spend data
Q3. Event schema evolution: the OrderCreated event gained a new tax_amount field 1 year ago. Events before that have no tax_amount. During replay, how do you handle old events?
AReject old events that don't have the required tax_amount field
BUpcasting — transform old events during replay to add default values for new fields (tax_amount=0 for pre-tax-era events)
CDelete all old events and ask customers to re-submit their orders
DRun a manual schema migration to backfill tax_amount on all historical events
Read model rebuild process: 1) Start the new projection consumer at offset 0 of the event log. 2) Process all historical events — this may take hours for 500M events. 3) When the consumer reaches "now" (current event log position), switch to live mode. 4) Only then expose the new read model to queries. Upcasting: event schema versioning requires an "upcaster" — a transformation layer that upgrades old event formats to the current schema before they reach the projection. Keep backward compatibility: add fields with defaults, never remove or rename fields.
Open Design Challenge
1
Design the replay pipeline for 500M events. How do you parallelize it across multiple workers? How do you ensure they don't process the same event twice?
2
The replay takes 6 hours. During this time, new OrderCreated events arrive. How do you ensure the new read model includes events from the replay AND events that arrived during rebuild?
3
Design a versioned upcaster system for the OrderCreated event with 3 schema versions (v1: no tax, v2: tax added, v3: tax + shipping_cost). How does the upcaster chain work?
Concept score: 0/3