System Design: Data Lake — Ingestion, Storage Layers, and Query Engine (2025)

Data Lake vs. Data Warehouse

A data warehouse stores structured, schema-on-write data optimized for SQL analytics (Redshift, BigQuery, Snowflake). A data lake stores raw data in any format (CSV, JSON, Parquet, images, logs) in cheap object storage (S3, GCS), with schema applied at query time (schema-on-read). The lakehouse architecture combines both: raw data in object storage with a transactional table format (Delta Lake, Apache Iceberg, Apache Hudi) that adds ACID transactions, schema enforcement, and efficient query performance on top of the data lake. Modern stacks often use a lakehouse for both analytical SQL queries and ML feature pipelines.

Ingestion Layer

Three ingestion patterns: Batch ingestion: nightly or hourly ETL jobs pull data from operational databases (via CDC — Change Data Capture — using Debezium, or full extracts), transform, and write to the lake. Tools: Apache Spark, AWS Glue, dbt. Partition by date: s3://datalake/events/year=2026/month=04/day=17/. Streaming ingestion: events flow through Kafka → Flink/Spark Streaming → land in the lake in micro-batches (every 1-5 minutes). Enables near-real-time analytics. Use S3 compaction jobs to merge many small files (the “small files problem”) into larger Parquet files for efficient reads. CDC (Change Data Capture): Debezium reads the database’s binary log (MySQL binlog, PostgreSQL WAL) and publishes every INSERT/UPDATE/DELETE as an event to Kafka. The lake consumer applies these changes to a lakehouse table, maintaining an up-to-date replica of the operational database for analytics without impacting production.

Storage Format: Parquet and Delta Lake

Parquet is the standard columnar storage format for analytics: column-oriented (reads only the columns needed), compressed per column (SNAPPY or ZSTD), with column statistics (min, max, null count) in metadata for predicate pushdown. A query like WHERE event_date = ‘2026-04-17’ skips entire row groups where max_date < 2026-04-17. Delta Lake adds on top: Transaction log: every write appends a JSON entry to the _delta_log/ directory describing the operation (add/remove files). Readers read the log to construct the current table state. ACID transactions: concurrent writes use optimistic concurrency — two writes to the same table succeed if they modify different partitions; conflict otherwise. Time travel: query any previous version of the table: SELECT * FROM events VERSION AS OF 100 or TIMESTAMP AS OF ‘2026-04-01’. Schema evolution: add columns without rewriting existing files — old files return NULL for the new column.

Query Engine and Metastore

The metastore (Hive Metastore, AWS Glue Catalog) stores table schemas, partition locations, and statistics. Query engines (Trino, Spark SQL, Athena, BigQuery Omni) read the metastore to understand the table structure, then scan only relevant Parquet files using partition pruning and predicate pushdown. Query optimization: Partition pruning: partition the table by date; queries filtering on date skip all other partitions. Data skipping: Delta Lake stores min/max statistics per file; queries skip files that cannot contain matching rows. Z-ordering: co-locate related data (e.g., Z-order by user_id) so that queries filtering on user_id read fewer files. Caching: Alluxio or local SSD caches hot data near compute nodes, reducing S3 reads for repeated queries.

Data Governance and Lineage

Governance requirements at scale: Access control: column-level and row-level security (restrict PII columns to authorized users; restrict rows by region for GDPR). Use Apache Ranger or lake-formation policies. Data lineage: track which upstream tables and jobs produced each table version (Apache Atlas, OpenLineage/Marquez). Required for debugging incorrect reports and GDPR “right to erasure” (identify all downstream tables that contain a user’s data). Data quality: run Great Expectations or Deequ checks on each new batch: null rate, value range, referential integrity. Fail the pipeline and alert on quality violations rather than silently ingesting bad data. Catalog: a data catalog (DataHub, Amundsen) makes tables discoverable with business descriptions, ownership, and usage statistics — critical for large organizations where data consumers cannot know all available tables.

See also: Databricks Interview Prep

See also: Meta Interview Prep

See also: Netflix Interview Prep

Scroll to Top