What Is Apache Kafka?
Apache Kafka is a distributed event streaming platform built for high-throughput, fault-tolerant, real-time data pipelines. Unlike traditional message queues (RabbitMQ, SQS), Kafka stores messages durably on disk and allows consumers to replay any offset — making it both a message queue and an event log.
Core Architecture
Topics, Partitions, Offsets
- Topic: logical stream of records (e.g.,
order-events). Append-only log. - Partition: a topic is split into N partitions, each an ordered, immutable sequence of records. Partitions enable parallelism — producers write to different partitions simultaneously.
- Offset: integer position of a record within a partition. Consumer tracks its own offset per partition (stored in
__consumer_offsetstopic). Consumers can seek backward and replay. - Key-based partitioning: records with the same key always land in the same partition (consistent hashing on key), guaranteeing per-key ordering. Round-robin partitioning for keyless records.
Brokers and Replication
- A Kafka cluster has N brokers (3 is standard). Each partition has one leader and RF-1 followers.
- Producers write to the leader; followers pull and replicate. Leader tracks which followers are in-sync (ISR — In-Sync Replicas).
- acks=0: fire and forget (no durability). acks=1: leader acknowledgment only (fast, small risk of loss on leader crash). acks=all: all ISR must acknowledge (strong durability, higher latency).
- If a leader crashes, the controller elects a new leader from the ISR. Unclean leader election (electing out-of-sync follower) can cause data loss — disabled by default.
Zookeeper vs. KRaft
Historically, Kafka required ZooKeeper for cluster metadata (leader election, broker registration). Kafka 3.x+ uses KRaft (Kafka Raft) — Kafka manages its own metadata via an internal Raft consensus log, eliminating the ZooKeeper dependency. KRaft supports millions of partitions and simplifies operations.
Producers
- Batching: producers accumulate records into batches (
batch.size) before sending. Larger batches improve throughput (fewer network round-trips) at the cost of latency (linger.msadds wait time to fill batches). - Compression: GZIP, Snappy, LZ4, ZSTD — applied at the batch level. LZ4 is fastest for most workloads; ZSTD best for ratio.
- Idempotent producer: set
enable.idempotence=true. Broker deduplicates retried records using producer ID + sequence number. Prevents duplicates from network retries. - Transactions: atomic writes across multiple partitions/topics. Begin transaction → produce → commit. Consumer reads only committed records with
isolation.level=read_committed.
Consumers and Consumer Groups
- Consumer group: set of consumers sharing a
group.id. Kafka assigns each partition to exactly one consumer in the group. With 6 partitions and 3 consumers: each consumer reads 2 partitions. - Rebalancing: when a consumer joins or leaves the group, Kafka reassigns partitions. Eager rebalance revokes all partitions first (causes pause); cooperative (incremental) rebalance moves only affected partitions.
- Offset commit: consumers periodically commit their current offset. Auto-commit (
enable.auto.commit=true) risks processing the same record twice after a crash. Manual commit after processing ensures at-least-once. Exactly-once requires transactional producers + consumers. - Lag: difference between latest offset and consumer’s committed offset. High lag → consumer falling behind. Monitor with
kafka-consumer-groups.sh --describeor Burrow.
Retention and Compaction
- Time-based retention: delete records older than
log.retention.hours(default 168h/7 days). - Size-based retention: delete oldest segments when partition exceeds
log.retention.bytes. - Log compaction: for topics configured with
cleanup.policy=compact, Kafka retains only the latest record per key. Used for changelog topics (materialized views, database CDC). Deleted records use tombstone values (null).
Kafka Streams and KSQL
- Kafka Streams: Java library for stream processing directly within Kafka. No separate cluster needed. Stateful operations (joins, aggregations) use RocksDB-backed state stores replicated to changelog topics.
- ksqlDB: SQL interface over Kafka Streams. Create streams and tables with SQL; push queries for real-time results.
- Flink/Spark: for heavy stateful processing (complex joins, windowed aggregations across large state), use Flink or Spark Structured Streaming reading from Kafka.
Common Design Patterns
- Event sourcing: every state change is an event. Kafka is the event log; services rebuild state by replaying. Pairs with CQRS — write path = Kafka producer, read path = consumer updating a read model.
- Outbox pattern: in a distributed transaction, write to DB + write to an outbox table atomically. A CDC tool (Debezium) reads the outbox and publishes to Kafka. Avoids dual-write inconsistency.
- Fan-out: one producer → multiple consumer groups, each processing independently (analytics, notifications, search indexing).
- Dead letter queue (DLQ): failed messages are sent to a separate DLQ topic after N retries. A separate consumer processes DLQ for alerting and manual reprocessing.
Throughput and Scaling
- Kafka handles millions of messages/second. LinkedIn’s cluster processes 7 trillion messages/day.
- Increase partitions to scale throughput (more parallelism). Rule: partitions ≥ peak_throughput / per_partition_throughput (typically 10–50 MB/s per partition depending on disk).
- Scale consumers: add consumers up to the number of partitions. Beyond that, consumers sit idle.
- Multi-region: MirrorMaker 2 replicates topics between clusters. Active-active is complex due to offset divergence — use different topic namespaces per region.
Interview Questions: Kafka
Q: Why is Kafka faster than traditional message queues?
Sequential disk I/O (append-only log) is faster than random I/O. Zero-copy: Kafka uses sendfile() system call to transfer data from disk to network socket without copying to userspace. Batching and compression reduce network overhead. Consumer pull model avoids push-related backpressure issues.
Q: How do you guarantee exactly-once delivery in Kafka?
Three requirements: (1) Idempotent producer (enable.idempotence=true) — deduplicates retried produces. (2) Transactional producer — atomically commits across partitions. (3) Consumer with isolation.level=read_committed — ignores uncommitted records. Together these form Kafka’s exactly-once semantics (EOS). Exactly-once end-to-end also requires idempotent consumers (e.g., database upsert with unique constraint).
Q: What happens when a consumer is slower than the producer?
Consumer lag grows. Kafka retains data per retention policy, so as long as lag doesn’t exceed retention, the consumer can catch up. If lag exceeds retention, messages are lost (for the consumer — data still existed, just expired). Solutions: increase consumer parallelism (add partitions + consumers), optimize consumer processing, or use a separate consumer group to process in priority order.
Asked at: LinkedIn Interview Guide
Asked at: Uber Interview Guide
Asked at: Netflix Interview Guide
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide