Question 1

What is the difference between a data lake, a data warehouse, and a lakehouse?

Accepted Answer

Data warehouse: stores structured, schema-on-write data in proprietary columnar format (Redshift, Snowflake, BigQuery). Optimized for SQL analytics, fast queries, but expensive storage and limited to structured data. Data lake: stores raw data in any format (CSV, JSON, Parquet, images) in cheap object storage (S3). Schema-on-read — flexible and cheap, but no ACID transactions, easy to create "data swamps" with unmanaged quality. Lakehouse: combines both — raw data in open format (Parquet) on object storage with a transactional metadata layer (Delta Lake, Iceberg, Hudi) that adds ACID transactions, schema enforcement, time travel, and query optimization. Enables SQL analytics and ML pipelines on the same storage with warehouse-like reliability and lake-like cost.

Question 2

What is Change Data Capture (CDC) and how does Debezium implement it?

Accepted Answer

CDC captures row-level changes (INSERT, UPDATE, DELETE) from an operational database and streams them to downstream consumers in real time. Debezium reads the database's internal replication log: MySQL binary log (binlog), PostgreSQL write-ahead log (WAL), MongoDB oplog. Each change event is published to a Kafka topic as a JSON message with before/after row states, operation type, and timestamp. Downstream consumers (lake ingestion jobs, search index updaters, cache invalidators) react to changes in near-real-time. Advantages over batch polling: lower latency (seconds vs. hours), no missed deletes (polling cannot detect deletions without tombstone columns), and lower database load (reads the replication log, not the tables).

Question 3

How does Delta Lake achieve ACID transactions on object storage?

Accepted Answer

Object storage (S3) is not inherently transactional — concurrent writes can corrupt data. Delta Lake uses an optimistic concurrency protocol on top of S3: every write appends a new JSON commit entry to the _delta_log/ directory with a monotonically increasing version number. The commit entry lists which Parquet files were added and removed. To write: (1) read the current table version, (2) write new Parquet files to S3, (3) atomically create the next commit log entry (using S3 conditional PUT — only succeeds if the file does not already exist). If another writer committed first, the PUT fails and the writer retries. Readers construct the current table state by replaying the commit log from the beginning (or from a checkpoint). Isolation: snapshot isolation — readers see a consistent snapshot at a specific version.

Question 4

What is the small files problem in data lakes and how do you solve it?

Accepted Answer

Streaming ingestion and frequent micro-batch writes create many small Parquet files (kilobytes each). This causes: slow query performance (opening thousands of files has high overhead per file — metadata fetches, S3 requests), and high storage costs (S3 charges per-request). Solution: periodic compaction jobs merge small files into larger ones (target: 128MB-1GB per file, the HDFS block size sweet spot for Spark). Delta Lake OPTIMIZE command runs compaction. Schedule compaction every few hours during low-traffic periods. Z-ordering during compaction: co-locate related rows (by user_id or event_date) so that queries filtering on those columns read fewer files. After compaction, VACUUM removes obsolete small files (respecting the data retention period for time travel queries).

Question 5

How does predicate pushdown work in columnar storage like Parquet?

Accepted Answer

Parquet stores statistics (min value, max value, null count) in the metadata of each row group (a horizontal partition of ~128MB). When a query engine reads Parquet with a filter (WHERE event_date = '2026-04-17'), it reads the metadata first: if a row group's max_event_date < '2026-04-17', that entire row group is skipped without reading any data. This is row group pruning. Additionally, Parquet is columnar: if the query only needs columns A and C, only the byte ranges for those columns are read — columns B, D, E are never fetched from storage. Combined: a query on a 1TB table may read only 1-5GB of data with good partitioning and column selection. This is why Parquet + columnar query engines are 10-100x more efficient than row-oriented CSV for analytical queries.

System Design: Data Lake — Ingestion, Storage Layers, and Query Engine (2025)

Data Lake vs. Data Warehouse

Ingestion Layer

Storage Format: Parquet and Delta Lake

Query Engine and Metastore

Data Governance and Lineage