System Design Interview: Design a CI/CD Deployment Pipeline

System Design: Design a Code Deployment System (CI/CD Pipeline)

Designing a CI/CD (Continuous Integration / Continuous Deployment) system is asked at infrastructure-focused companies like Cloudflare, Stripe, and Atlassian. The challenge is building a reliable, fast, rollback-capable pipeline that deploys code to thousands of servers safely.

Requirements

Functional: trigger build on code push, run tests, build artifacts, deploy to staging → production, support rollback, provide deployment status and logs.

Non-functional: fast builds (target < 10 min), reliable (no partial deploys), safe (gradual rollout, auto-rollback on errors), auditable (who deployed what, when).

Pipeline Stages

Code Push (git push)
       │
       ▼ webhook
┌──────────────┐
│  CI Server   │ (GitHub Actions, Jenkins, BuildKite)
│  - clone     │
│  - install   │
│  - lint/test │
│  - build     │
└──────┬───────┘
       │ artifact (Docker image, .tar.gz)
       ▼
┌──────────────────┐
│  Artifact Store  │ (S3, ECR, Artifactory)
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Staging Deploy  │ (smoke tests)
└──────┬───────────┘
       │ approval gate (manual or auto)
       ▼
┌──────────────────────────────────┐
│  Production Deploy               │
│  - Blue/Green OR Canary rollout  │
│  - Health checks post-deploy     │
│  - Auto-rollback on error spike  │
└──────────────────────────────────┘

Deployment Strategies

Blue/Green Deployment

Maintain two identical environments (blue = current, green = new). Deploy to green, run smoke tests, switch load balancer traffic to green. Rollback = switch back to blue instantly. Requires 2× infrastructure cost temporarily.

Canary Deployment

Route 1-5% of traffic to new version. Monitor error rates and latency. Gradually increase to 10%, 25%, 50%, 100%. Auto-rollback if error rate exceeds threshold. Used by Facebook, Google for gradual feature releases.

# Canary rollout progression
stages = [
    {"weight": 1,   "wait_minutes": 5,  "error_threshold": 0.01},
    {"weight": 5,   "wait_minutes": 10, "error_threshold": 0.01},
    {"weight": 25,  "wait_minutes": 30, "error_threshold": 0.005},
    {"weight": 100, "wait_minutes": 0,  "error_threshold": 0},
]

Rolling Deployment

Replace instances one batch at a time (e.g., 10% of fleet at once). Slower than blue/green, less infrastructure, but mixed versions run concurrently — requires backward compatibility.

Build System Design

  • Build workers: ephemeral containers, auto-scaled from a pool. Each build gets a fresh isolated environment.
  • Build caching: cache Docker layers, npm/pip dependencies by hash. Cache key = hash(package.json) or hash(requirements.txt). A cache hit reduces a 5-minute build to 30 seconds.
  • Parallelism: fan-out test suites across multiple workers, merge results. Large test suites (10,000+ tests) run in parallel shards.
  • Build queue: Kafka or SQS. Multiple builds queued; priority queue for main branch builds over feature branches.

Artifact Management

  • Tag every artifact with git commit SHA, branch, and build timestamp
  • Immutable artifacts: never overwrite — create new artifact per build
  • Retention policy: keep last N successful builds per branch; keep all production deploys for 90 days
  • Artifact signing: sign Docker images or tarballs to prevent tampered deployments

Rollback Mechanism

  • Fast rollback: keep previous artifact version ready; switch load balancer or Kubernetes deployment back in < 60 seconds
  • Automatic rollback triggers: error rate > threshold, P99 latency spike, health check failures after deploy
  • Database migration rollback: hardest part. Always make migrations backward-compatible (add columns before removing old ones). Maintain migration version in DB.

Observability

  • Build logs: stream to centralized log store (Elasticsearch), retained for 30 days
  • Deployment events: emit to event bus (PagerDuty, Slack notifications on deploy start/success/failure)
  • Deploy dashboard: current version per service, recent deploy history, rollback button
  • Metrics: build duration P50/P95/P99, build success rate, deploy frequency (DORA metric)

Interview Checklist

  • Draw the full pipeline: push → build → test → artifact → staging → production
  • Explain build caching and parallelism for fast builds
  • Compare blue/green vs canary vs rolling; know when to use each
  • Address rollback: both application-level and database migration rollback
  • Mention DORA metrics: deployment frequency, lead time, MTTR, change failure rate

  • Twitter Interview Guide
  • Airbnb Interview Guide
  • Shopify Interview Guide
  • Atlassian Interview Guide
  • Cloudflare Interview Guide
  • Stripe Interview Guide
  • Scroll to Top