System Design: Event-Driven Architecture — Kafka, Event Sourcing, CQRS, and Saga Pattern

What Is Event-Driven Architecture?

Event-Driven Architecture (EDA) decouples services by having them communicate through events rather than direct API calls. A producer publishes an event when something happens; consumers subscribe and react independently. This enables loose coupling, independent scalability, and natural audit trails.

Core Components

Event broker: Kafka, RabbitMQ, or AWS SNS/SQS. Kafka is the dominant choice for high-throughput, durable event streaming. Events are persisted on disk, replayed on demand, and consumed by multiple independent consumer groups.

Event schema: Define schemas with Avro or Protobuf, enforce via a Schema Registry. Versioning: use backward-compatible schema evolution (add optional fields, never remove required ones).

Event types: Domain events (OrderPlaced, PaymentProcessed), integration events (cross-service), and commands (RequestPayment — directed to a specific service).

Kafka Architecture

Topics are partitioned for parallelism. Each partition is an ordered log. Consumers in the same consumer group split partitions among themselves — each partition is consumed by exactly one consumer per group. More partitions = more parallelism (up to the number of consumers). Retention period (default 7 days) allows replay and catch-up.

Offset management: consumers commit offsets after processing. At-least-once delivery is the default — idempotent consumers handle duplicate events. For exactly-once: use Kafka transactions (producer + consumer in a transaction).

Event Sourcing

Instead of storing current state, store every state-changing event as an immutable record. Current state is derived by replaying events. Benefits: full audit trail, temporal queries (“what was the account balance on date X?”), natural event publishing. Drawbacks: query complexity (no simple SELECT), event schema evolution is hard, replay can be slow for long histories (mitigated by snapshots).

Implementation: append events to an event store (EventStoreDB, Kafka, PostgreSQL append-only table). Derive projections (read models) by consuming the event stream.

CQRS (Command Query Responsibility Segregation)

Separate the write model (Commands) from the read model (Queries). Commands mutate state and emit events. Events update one or more read-optimized projections (denormalized tables, Elasticsearch indexes, Redis caches). Queries hit the read model — no joins, optimized for display. Trade-off: eventual consistency between write and read models (usually milliseconds to seconds).

When to use: high read/write ratio with different read/write shapes (e.g., write one order, read many aggregated dashboards).

Saga Pattern for Distributed Transactions

A saga is a sequence of local transactions, each publishing an event that triggers the next step. If a step fails, compensating transactions undo the previous steps. Two implementations:

Choreography: each service listens for events and publishes its own events. No central coordinator. Simple for short flows; hard to trace and debug for long ones.

Orchestration: a Saga Orchestrator sends commands to each service and handles failures centrally. Easier to reason about and monitor. Used by Uber, Netflix for complex workflows.

Example — Order Saga: (1) OrderCreated → reserve inventory. (2) InventoryReserved → charge payment. (3) PaymentCharged → confirm order. If payment fails → compensate: release inventory → cancel order.

Outbox Pattern

Atomically persist state changes and publish events. The service writes to the database AND an outbox table in the same transaction. A separate relay process reads the outbox and publishes to Kafka. Solves the dual-write problem: without this, a service might update the DB but crash before publishing the event (or vice versa). The relay can use polling or PostgreSQL LISTEN/NOTIFY for low-latency delivery.

System Design Interview Tips

  • Justify EDA: mention loose coupling, independent scaling, audit trails — not just “it uses Kafka”
  • Address eventual consistency: how do consumers handle out-of-order events? (sequence numbers, idempotency keys)
  • Dead letter queue: what happens to events that repeatedly fail processing?
  • Schema evolution: how do you change event schemas without breaking consumers?
  • Monitoring: consumer lag (how far behind is a consumer group?) is the key EDA metric

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between event-driven and request-driven architecture?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In request-driven (synchronous) architecture, Service A calls Service B directly via HTTP/gRPC and waits for the response. This creates temporal coupling — if B is down, A fails. In event-driven (asynchronous) architecture, A publishes an event to a broker and returns immediately. B consumes the event when ready. Services are decoupled in time and deployment. Request-driven is simpler for simple workflows and when you need an immediate response. Event-driven is better when: multiple services need to react to one event, you need to scale producers and consumers independently, you want natural audit trails, or the consumer can process asynchronously.”
}
},
{
“@type”: “Question”,
“name”: “What is the Outbox pattern and why is it needed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Outbox pattern solves the dual-write problem. Without it, a service might write to the database and then crash before publishing the event — or publish the event but fail to commit the DB transaction. The fix: write the event to an outbox table in the same DB transaction as the state change. A separate relay process (Debezium CDC or a polling job) reads uncommitted outbox records and publishes them to Kafka, then marks them as published. This guarantees at-least-once event delivery with no data loss. The relay can safely retry — Kafka producers are idempotent and consumers handle duplicates.”
}
},
{
“@type”: “Question”,
“name”: “What is Event Sourcing and when should you use it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Event Sourcing stores every state change as an immutable event (OrderPlaced, ItemAdded, PaymentCharged) instead of just the current state. Current state is derived by replaying events. Benefits: full audit trail, temporal queries, natural event publishing for EDA. Drawbacks: query complexity (projections must be built and maintained), event schema changes require migration strategies, long event histories need periodic snapshots to avoid slow replay. Use Event Sourcing when: you need a complete audit trail (finance, compliance), you want to replay historical state for debugging, or your domain is naturally event-based. Avoid for CRUD-heavy systems with no audit requirements.”
}
},
{
“@type”: “Question”,
“name”: “How does the Saga pattern handle distributed transactions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction for rollback. Choreography-based: each service listens for an event and publishes the next event; no central coordinator. Simple but hard to trace. Orchestration-based: a Saga Orchestrator sends commands to each service and handles failure by issuing compensating commands. Easier to monitor and debug. Example: Order Saga — reserve inventory, charge payment, ship order. If payment fails: release inventory (compensating), cancel order. Sagas do not provide isolation (other transactions can see intermediate state) — this is a known trade-off compared to ACID transactions.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle consumer failures and ensure at-least-once processing in Kafka?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Kafka consumers commit their offset after successfully processing a message. If a consumer crashes mid-processing, the message is reprocessed when the consumer restarts (or another consumer in the group takes over the partition). This gives at-least-once delivery. To handle duplicates: make consumers idempotent — check if the event was already processed (using the event ID in a processed_events table or Redis SET NX). For exactly-once: use Kafka transactions — the producer writes the output and the consumer offset commit in a single atomic transaction. Monitor consumer lag (difference between latest offset and committed offset) to detect slow or stalled consumers.”
}
}
]
}

Asked at: Databricks Interview Guide

Asked at: Twitter/X Interview Guide

Asked at: Netflix Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Atlassian Interview Guide

Scroll to Top