What Is Event-Driven Architecture?
Event-Driven Architecture (EDA) decouples services by having them communicate through events rather than direct API calls. A producer publishes an event when something happens; consumers subscribe and react independently. This enables loose coupling, independent scalability, and natural audit trails.
Core Components
Event broker: Kafka, RabbitMQ, or AWS SNS/SQS. Kafka is the dominant choice for high-throughput, durable event streaming. Events are persisted on disk, replayed on demand, and consumed by multiple independent consumer groups.
Event schema: Define schemas with Avro or Protobuf, enforce via a Schema Registry. Versioning: use backward-compatible schema evolution (add optional fields, never remove required ones).
Event types: Domain events (OrderPlaced, PaymentProcessed), integration events (cross-service), and commands (RequestPayment — directed to a specific service).
Kafka Architecture
Topics are partitioned for parallelism. Each partition is an ordered log. Consumers in the same consumer group split partitions among themselves — each partition is consumed by exactly one consumer per group. More partitions = more parallelism (up to the number of consumers). Retention period (default 7 days) allows replay and catch-up.
Offset management: consumers commit offsets after processing. At-least-once delivery is the default — idempotent consumers handle duplicate events. For exactly-once: use Kafka transactions (producer + consumer in a transaction).
Event Sourcing
Instead of storing current state, store every state-changing event as an immutable record. Current state is derived by replaying events. Benefits: full audit trail, temporal queries (“what was the account balance on date X?”), natural event publishing. Drawbacks: query complexity (no simple SELECT), event schema evolution is hard, replay can be slow for long histories (mitigated by snapshots).
Implementation: append events to an event store (EventStoreDB, Kafka, PostgreSQL append-only table). Derive projections (read models) by consuming the event stream.
CQRS (Command Query Responsibility Segregation)
Separate the write model (Commands) from the read model (Queries). Commands mutate state and emit events. Events update one or more read-optimized projections (denormalized tables, Elasticsearch indexes, Redis caches). Queries hit the read model — no joins, optimized for display. Trade-off: eventual consistency between write and read models (usually milliseconds to seconds).
When to use: high read/write ratio with different read/write shapes (e.g., write one order, read many aggregated dashboards).
Saga Pattern for Distributed Transactions
A saga is a sequence of local transactions, each publishing an event that triggers the next step. If a step fails, compensating transactions undo the previous steps. Two implementations:
Choreography: each service listens for events and publishes its own events. No central coordinator. Simple for short flows; hard to trace and debug for long ones.
Orchestration: a Saga Orchestrator sends commands to each service and handles failures centrally. Easier to reason about and monitor. Used by Uber, Netflix for complex workflows.
Example — Order Saga: (1) OrderCreated → reserve inventory. (2) InventoryReserved → charge payment. (3) PaymentCharged → confirm order. If payment fails → compensate: release inventory → cancel order.
Outbox Pattern
Atomically persist state changes and publish events. The service writes to the database AND an outbox table in the same transaction. A separate relay process reads the outbox and publishes to Kafka. Solves the dual-write problem: without this, a service might update the DB but crash before publishing the event (or vice versa). The relay can use polling or PostgreSQL LISTEN/NOTIFY for low-latency delivery.
System Design Interview Tips
- Justify EDA: mention loose coupling, independent scaling, audit trails — not just “it uses Kafka”
- Address eventual consistency: how do consumers handle out-of-order events? (sequence numbers, idempotency keys)
- Dead letter queue: what happens to events that repeatedly fail processing?
- Schema evolution: how do you change event schemas without breaking consumers?
- Monitoring: consumer lag (how far behind is a consumer group?) is the key EDA metric
Asked at: Databricks Interview Guide
Asked at: Twitter/X Interview Guide
Asked at: Netflix Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Atlassian Interview Guide