System Design Interview: Kubernetes and Container Orchestration

Kubernetes is the de facto standard for container orchestration. Understanding its architecture, scheduling model, and operational patterns is increasingly expected in senior engineering interviews at companies that run microservices at scale.

Kubernetes Architecture

Control Plane (Master):
  ┌──────────────────────────────────────────────┐
  │ API Server      — central REST endpoint       │
  │ etcd            — distributed KV store        │
  │ Scheduler       — assigns pods to nodes       │
  │ Controller Mgr  — reconciliation loops        │
  │ Cloud Controller— cloud provider integration  │
  └──────────────────────────────────────────────┘

Worker Nodes:
  ┌──────────────────────────────────────────────┐
  │ kubelet         — node agent, manages pods    │
  │ kube-proxy      — network rules (iptables)    │
  │ Container Runtime (containerd / CRI-O)        │
  │ Pods (1..N)                                   │
  └──────────────────────────────────────────────┘

etcd: The Source of Truth

  • Stores all cluster state: pod specs, service definitions, configmaps, secrets
  • Raft consensus — typically 3 or 5 nodes for HA (tolerates N/2 failures)
  • API server is the ONLY component that talks to etcd directly
  • Watch mechanism: components subscribe to key prefixes; etcd pushes changes → reactive reconciliation

Pod Lifecycle and Scheduling

Pod scheduling flow:
  1. User creates Pod spec → API Server stores in etcd (Pending)
  2. Scheduler watches for unscheduled pods
  3. Filtering: eliminate nodes that don't satisfy constraints
     - Resource requests: node has enough CPU/memory
     - Node selectors / affinity rules
     - Taints and tolerations
     - Pod topology spread constraints
  4. Scoring: rank remaining nodes
     - Least allocated (spread evenly)
     - Image locality (node already has image)
     - Inter-pod affinity scores
  5. Bind pod to highest-scoring node → API Server updates etcd
  6. kubelet on node watches → pulls image → starts container

Resource Requests vs Limits

resources:
  requests:          # scheduler uses this for placement
    cpu: "500m"      # 0.5 CPU cores
    memory: "256Mi"
  limits:            # hard cap at runtime
    cpu: "1000m"     # throttled if exceeded (not killed)
    memory: "512Mi"  # OOMKilled if exceeded

QoS Classes:
  Guaranteed: requests == limits → never evicted under pressure
  Burstable:  requests < limits  → evicted if node is pressured
  BestEffort: no requests/limits → evicted first

Vertical Pod Autoscaler (VPA): automatically adjusts requests
Horizontal Pod Autoscaler (HPA): adjusts replica count

Deployments, ReplicaSets, and Rolling Updates

Deployment → manages → ReplicaSet → manages → Pods

Rolling update strategy:
  maxUnavailable: 25%  # how many pods can be down during update
  maxSurge: 25%        # how many extra pods can be created

Update flow:
  1. New ReplicaSet created with new pod template
  2. Scale up new RS by maxSurge pods
  3. Scale down old RS by maxUnavailable pods
  4. Repeat until new RS = desired, old RS = 0

Rollback:
  kubectl rollout undo deployment/my-app
  (keeps old ReplicaSet for instant rollback)

Blue-Green via labels:
  Service selector: version=blue → route to v1 pods
  Deploy v2 pods, test, switch selector: version=green
  Zero-downtime cutover

Kubernetes Networking

Network model rules:
  - Every pod gets a unique cluster-wide IP
  - Pods can communicate with any other pod without NAT
  - Nodes can communicate with pods without NAT

Implementation (CNI plugins):
  Calico: eBPF or iptables, supports NetworkPolicy, BGP peering
  Flannel: simple VXLAN overlay, no NetworkPolicy
  Cilium:  eBPF-based, L7 policy, Hubble observability

Services (stable VIPs for pods):
  ClusterIP:    internal VIP, kube-proxy creates iptables rules
  NodePort:     expose on every node's IP:port (30000-32767)
  LoadBalancer: cloud provider creates external LB, maps to NodePort
  Headless:     no VIP, DNS returns individual pod IPs (for StatefulSets)

DNS within cluster:
  my-service.my-namespace.svc.cluster.local → ClusterIP
  pod-ip.my-namespace.pod.cluster.local    → pod IP

StatefulSets for Stateful Workloads

StatefulSet guarantees (vs Deployment):
  - Stable, unique pod names: mysql-0, mysql-1, mysql-2
  - Ordered, sequential pod creation (0 → 1 → 2)
  - Stable network identity: mysql-0.mysql.default.svc.cluster.local
  - Persistent volume per pod (PVC not shared, not deleted on pod delete)

Use cases: databases (MySQL, Cassandra, Kafka, ZooKeeper)

Example: Kafka StatefulSet
  kafka-0 → PVC: kafka-data-0 (broker 0)
  kafka-1 → PVC: kafka-data-1 (broker 1)
  kafka-2 → PVC: kafka-data-2 (broker 2)
  Headless service → DNS for each broker separately

Horizontal Pod Autoscaler (HPA)

HPA control loop (every 15s):
  desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))

Example:
  Current: 4 replicas, CPU at 80%
  Target: CPU 50%
  desired = ceil(4 × 80/50) = ceil(6.4) = 7 replicas → scale up to 7

Metric sources:
  Built-in: CPU utilization, memory utilization
  Custom:   requests/sec, queue depth (via Prometheus + adapter)
  External: SQS queue depth, Pub/Sub undelivered messages (KEDA)

Scale-down stabilization (default 5min):
  Prevents thrashing — only scale down if needed for 5 consecutive minutes

Kubernetes Observability

The three pillars:

Metrics: Prometheus scrapes /metrics endpoints
  → Grafana dashboards
  → AlertManager → PagerDuty

Logs: stdout/stderr → node log agent (Fluentd/Fluentbit)
  → Elasticsearch or Cloud Logging
  → Structured JSON logs with pod name, namespace, trace_id

Traces: OpenTelemetry SDK in app
  → Collector sidecar or daemonset
  → Jaeger / Tempo / AWS X-Ray

Key metrics to monitor:
  Pod: CPU throttling rate, OOMKill count, restart count
  Node: allocatable vs requested CPU/memory, eviction rate
  Cluster: pending pods (scheduling backlog), API server latency

Common Interview Design Questions

How does Kubernetes handle node failure?

Node Controller detects missing heartbeats. After node-monitor-grace-period (default 40s), node is marked NotReady. After pod-eviction-timeout (default 5 min), pods are evicted (marked for deletion) and rescheduled to healthy nodes. With --pod-eviction-timeout=0 and TaintBasedEvictions enabled, eviction can happen in ~1 minute.

How do you run a database in Kubernetes?

Use StatefulSet + PersistentVolumeClaim (StorageClass: gp3/pd-ssd). For production, use an operator (CloudNativePG, Vitess, CockroachDB operator) that handles replication, failover, and backups. Single-node databases in k8s are fine; multi-node requires operator for coordination. Alternatively, use managed cloud databases (RDS, Cloud SQL) outside k8s for simpler ops.

Kubernetes vs serverless

Factor Kubernetes Serverless (Lambda)
Cold start Pod startup ~5-30s ms to seconds
Max duration Unlimited 15 min (Lambda)
Scaling HPA (minutes) Per-request (instant)
Cost model Reserved capacity Per-invocation
Debugging Full shell access Limited (logs only)
Best for Long-running services Event-driven, bursty

  • Atlassian Interview Guide
  • Snap Interview Guide
  • Uber Interview Guide
  • LinkedIn Interview Guide
  • Cloudflare Interview Guide
  • Netflix Interview Guide
  • Companies That Ask This

    Scroll to Top