Kubernetes is the de facto standard for container orchestration. Understanding its architecture, scheduling model, and operational patterns is increasingly expected in senior engineering interviews at companies that run microservices at scale.
Kubernetes Architecture
Control Plane (Master):
┌──────────────────────────────────────────────┐
│ API Server — central REST endpoint │
│ etcd — distributed KV store │
│ Scheduler — assigns pods to nodes │
│ Controller Mgr — reconciliation loops │
│ Cloud Controller— cloud provider integration │
└──────────────────────────────────────────────┘
Worker Nodes:
┌──────────────────────────────────────────────┐
│ kubelet — node agent, manages pods │
│ kube-proxy — network rules (iptables) │
│ Container Runtime (containerd / CRI-O) │
│ Pods (1..N) │
└──────────────────────────────────────────────┘
etcd: The Source of Truth
- Stores all cluster state: pod specs, service definitions, configmaps, secrets
- Raft consensus — typically 3 or 5 nodes for HA (tolerates N/2 failures)
- API server is the ONLY component that talks to etcd directly
- Watch mechanism: components subscribe to key prefixes; etcd pushes changes → reactive reconciliation
Pod Lifecycle and Scheduling
Pod scheduling flow:
1. User creates Pod spec → API Server stores in etcd (Pending)
2. Scheduler watches for unscheduled pods
3. Filtering: eliminate nodes that don't satisfy constraints
- Resource requests: node has enough CPU/memory
- Node selectors / affinity rules
- Taints and tolerations
- Pod topology spread constraints
4. Scoring: rank remaining nodes
- Least allocated (spread evenly)
- Image locality (node already has image)
- Inter-pod affinity scores
5. Bind pod to highest-scoring node → API Server updates etcd
6. kubelet on node watches → pulls image → starts container
Resource Requests vs Limits
resources:
requests: # scheduler uses this for placement
cpu: "500m" # 0.5 CPU cores
memory: "256Mi"
limits: # hard cap at runtime
cpu: "1000m" # throttled if exceeded (not killed)
memory: "512Mi" # OOMKilled if exceeded
QoS Classes:
Guaranteed: requests == limits → never evicted under pressure
Burstable: requests < limits → evicted if node is pressured
BestEffort: no requests/limits → evicted first
Vertical Pod Autoscaler (VPA): automatically adjusts requests
Horizontal Pod Autoscaler (HPA): adjusts replica count
Deployments, ReplicaSets, and Rolling Updates
Deployment → manages → ReplicaSet → manages → Pods
Rolling update strategy:
maxUnavailable: 25% # how many pods can be down during update
maxSurge: 25% # how many extra pods can be created
Update flow:
1. New ReplicaSet created with new pod template
2. Scale up new RS by maxSurge pods
3. Scale down old RS by maxUnavailable pods
4. Repeat until new RS = desired, old RS = 0
Rollback:
kubectl rollout undo deployment/my-app
(keeps old ReplicaSet for instant rollback)
Blue-Green via labels:
Service selector: version=blue → route to v1 pods
Deploy v2 pods, test, switch selector: version=green
Zero-downtime cutover
Kubernetes Networking
Network model rules:
- Every pod gets a unique cluster-wide IP
- Pods can communicate with any other pod without NAT
- Nodes can communicate with pods without NAT
Implementation (CNI plugins):
Calico: eBPF or iptables, supports NetworkPolicy, BGP peering
Flannel: simple VXLAN overlay, no NetworkPolicy
Cilium: eBPF-based, L7 policy, Hubble observability
Services (stable VIPs for pods):
ClusterIP: internal VIP, kube-proxy creates iptables rules
NodePort: expose on every node's IP:port (30000-32767)
LoadBalancer: cloud provider creates external LB, maps to NodePort
Headless: no VIP, DNS returns individual pod IPs (for StatefulSets)
DNS within cluster:
my-service.my-namespace.svc.cluster.local → ClusterIP
pod-ip.my-namespace.pod.cluster.local → pod IP
StatefulSets for Stateful Workloads
StatefulSet guarantees (vs Deployment):
- Stable, unique pod names: mysql-0, mysql-1, mysql-2
- Ordered, sequential pod creation (0 → 1 → 2)
- Stable network identity: mysql-0.mysql.default.svc.cluster.local
- Persistent volume per pod (PVC not shared, not deleted on pod delete)
Use cases: databases (MySQL, Cassandra, Kafka, ZooKeeper)
Example: Kafka StatefulSet
kafka-0 → PVC: kafka-data-0 (broker 0)
kafka-1 → PVC: kafka-data-1 (broker 1)
kafka-2 → PVC: kafka-data-2 (broker 2)
Headless service → DNS for each broker separately
Horizontal Pod Autoscaler (HPA)
HPA control loop (every 15s):
desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
Example:
Current: 4 replicas, CPU at 80%
Target: CPU 50%
desired = ceil(4 × 80/50) = ceil(6.4) = 7 replicas → scale up to 7
Metric sources:
Built-in: CPU utilization, memory utilization
Custom: requests/sec, queue depth (via Prometheus + adapter)
External: SQS queue depth, Pub/Sub undelivered messages (KEDA)
Scale-down stabilization (default 5min):
Prevents thrashing — only scale down if needed for 5 consecutive minutes
Kubernetes Observability
The three pillars:
Metrics: Prometheus scrapes /metrics endpoints
→ Grafana dashboards
→ AlertManager → PagerDuty
Logs: stdout/stderr → node log agent (Fluentd/Fluentbit)
→ Elasticsearch or Cloud Logging
→ Structured JSON logs with pod name, namespace, trace_id
Traces: OpenTelemetry SDK in app
→ Collector sidecar or daemonset
→ Jaeger / Tempo / AWS X-Ray
Key metrics to monitor:
Pod: CPU throttling rate, OOMKill count, restart count
Node: allocatable vs requested CPU/memory, eviction rate
Cluster: pending pods (scheduling backlog), API server latency
Common Interview Design Questions
How does Kubernetes handle node failure?
Node Controller detects missing heartbeats. After node-monitor-grace-period (default 40s), node is marked NotReady. After pod-eviction-timeout (default 5 min), pods are evicted (marked for deletion) and rescheduled to healthy nodes. With --pod-eviction-timeout=0 and TaintBasedEvictions enabled, eviction can happen in ~1 minute.
How do you run a database in Kubernetes?
Use StatefulSet + PersistentVolumeClaim (StorageClass: gp3/pd-ssd). For production, use an operator (CloudNativePG, Vitess, CockroachDB operator) that handles replication, failover, and backups. Single-node databases in k8s are fine; multi-node requires operator for coordination. Alternatively, use managed cloud databases (RDS, Cloud SQL) outside k8s for simpler ops.
Kubernetes vs serverless
| Factor | Kubernetes | Serverless (Lambda) |
|---|---|---|
| Cold start | Pod startup ~5-30s | ms to seconds |
| Max duration | Unlimited | 15 min (Lambda) |
| Scaling | HPA (minutes) | Per-request (instant) |
| Cost model | Reserved capacity | Per-invocation |
| Debugging | Full shell access | Limited (logs only) |
| Best for | Long-running services | Event-driven, bursty |