Q: How do you handle sustained threshold alerts without firing on every single reading?

Use a Flink stateful stream processor. Per-device state tracks the count of consecutive readings that exceed the threshold. Only fire the alert when the count reaches the configured sustained_periods (e.g., 5 consecutive readings). Also implement deduplication: once an alert fires, suppress subsequent alerts until the condition clears. Fire a resolution alert when the count drops back to zero. This prevents alert storms while still catching genuine sustained faults.

Q: How does data tiering reduce IoT storage costs?

Recent data (last 7-30 days) stays on fast NVMe SSDs in the time-series database for low-latency queries. Older data is automatically downsampled (1-second raw → 1-minute aggregates → 1-hour aggregates) by continuous queries or stream jobs, reducing data volume 60-3600x. Archived raw data is exported to object storage (S3) in Parquet format, which is queryable via Athena or BigQuery at much lower cost than SSD storage. The combination can reduce storage cost by 95% compared to keeping all raw data on SSD indefinitely.

Q: How do you handle device connectivity at massive scale (millions of concurrent devices)?

Deploy a horizontally scalable MQTT broker cluster (EMQX or HiveMQ). Each broker node handles ~100K concurrent connections. Route devices to brokers via a load balancer with sticky sessions (same device always connects to the same broker for session persistence). For 1 million devices: ~10 broker nodes. Broker cluster maintains a distributed session store (Redis) so if a broker fails, devices reconnect to another node and their queued messages are preserved. Kafka provides backpressure isolation — if processing slows, messages queue in Kafka rather than backing up to the brokers.

Question 1

Why is MQTT preferred over HTTP for IoT device telemetry?

Accepted Answer

MQTT uses a 2-byte fixed header and persistent TCP connections, consuming far less bandwidth and battery than HTTP's stateless request-response with large headers. It supports QoS levels for reliability, persistent sessions that queue messages while a device is offline, and a pub/sub model that decouples devices from consumers. At 1 million devices sending data every 10 seconds, MQTT's efficiency difference over HTTP translates to significant infrastructure cost savings.

Question 2

How does a device shadow work in an IoT platform?

Accepted Answer

A device shadow (also called a device twin) is a cached copy of the device's last known state stored in Redis. Every time a device sends a telemetry reading, the stream processor updates the shadow: HSET device:{id}:shadow temperature 23.4 updated_at 1700000000. Applications querying "current temperature of device X" read from the shadow in < 1ms instead of scanning the time-series database. Crucially, the shadow is readable even when the device is offline.

Question 3

How do you handle sustained threshold alerts without firing on every single reading?

Accepted Answer

Use a Flink stateful stream processor. Per-device state tracks the count of consecutive readings that exceed the threshold. Only fire the alert when the count reaches the configured sustained_periods (e.g., 5 consecutive readings). Also implement deduplication: once an alert fires, suppress subsequent alerts until the condition clears. Fire a resolution alert when the count drops back to zero. This prevents alert storms while still catching genuine sustained faults.

Question 4

How does data tiering reduce IoT storage costs?

Accepted Answer

Recent data (last 7-30 days) stays on fast NVMe SSDs in the time-series database for low-latency queries. Older data is automatically downsampled (1-second raw → 1-minute aggregates → 1-hour aggregates) by continuous queries or stream jobs, reducing data volume 60-3600x. Archived raw data is exported to object storage (S3) in Parquet format, which is queryable via Athena or BigQuery at much lower cost than SSD storage. The combination can reduce storage cost by 95% compared to keeping all raw data on SSD indefinitely.

Question 5

How do you handle device connectivity at massive scale (millions of concurrent devices)?

Accepted Answer

Deploy a horizontally scalable MQTT broker cluster (EMQX or HiveMQ). Each broker node handles ~100K concurrent connections. Route devices to brokers via a load balancer with sticky sessions (same device always connects to the same broker for session persistence). For 1 million devices: ~10 broker nodes. Broker cluster maintains a distributed session store (Redis) so if a broker fails, devices reconnect to another node and their queued messages are preserved. Kafka provides backpressure isolation — if processing slows, messages queue in Kafka rather than backing up to the brokers.

System Design: IoT Data Platform — Device Ingestion, Time-Series Storage, and Real-Time Alerting

Requirements

Device Connectivity and Ingestion

Time-Series Storage

Real-Time Alerting

Query and Visualization