System Design: Design a Code Deployment System (CI/CD Pipeline)
Designing a CI/CD (Continuous Integration / Continuous Deployment) system is asked at infrastructure-focused companies like Cloudflare, Stripe, and Atlassian. The challenge is building a reliable, fast, rollback-capable pipeline that deploys code to thousands of servers safely.
Requirements
Functional: trigger build on code push, run tests, build artifacts, deploy to staging → production, support rollback, provide deployment status and logs.
Non-functional: fast builds (target < 10 min), reliable (no partial deploys), safe (gradual rollout, auto-rollback on errors), auditable (who deployed what, when).
Pipeline Stages
Code Push (git push)
│
▼ webhook
┌──────────────┐
│ CI Server │ (GitHub Actions, Jenkins, BuildKite)
│ - clone │
│ - install │
│ - lint/test │
│ - build │
└──────┬───────┘
│ artifact (Docker image, .tar.gz)
▼
┌──────────────────┐
│ Artifact Store │ (S3, ECR, Artifactory)
└──────┬───────────┘
│
▼
┌──────────────────┐
│ Staging Deploy │ (smoke tests)
└──────┬───────────┘
│ approval gate (manual or auto)
▼
┌──────────────────────────────────┐
│ Production Deploy │
│ - Blue/Green OR Canary rollout │
│ - Health checks post-deploy │
│ - Auto-rollback on error spike │
└──────────────────────────────────┘
Deployment Strategies
Blue/Green Deployment
Maintain two identical environments (blue = current, green = new). Deploy to green, run smoke tests, switch load balancer traffic to green. Rollback = switch back to blue instantly. Requires 2× infrastructure cost temporarily.
Canary Deployment
Route 1-5% of traffic to new version. Monitor error rates and latency. Gradually increase to 10%, 25%, 50%, 100%. Auto-rollback if error rate exceeds threshold. Used by Facebook, Google for gradual feature releases.
# Canary rollout progression
stages = [
{"weight": 1, "wait_minutes": 5, "error_threshold": 0.01},
{"weight": 5, "wait_minutes": 10, "error_threshold": 0.01},
{"weight": 25, "wait_minutes": 30, "error_threshold": 0.005},
{"weight": 100, "wait_minutes": 0, "error_threshold": 0},
]
Rolling Deployment
Replace instances one batch at a time (e.g., 10% of fleet at once). Slower than blue/green, less infrastructure, but mixed versions run concurrently — requires backward compatibility.
Build System Design
- Build workers: ephemeral containers, auto-scaled from a pool. Each build gets a fresh isolated environment.
- Build caching: cache Docker layers, npm/pip dependencies by hash. Cache key = hash(package.json) or hash(requirements.txt). A cache hit reduces a 5-minute build to 30 seconds.
- Parallelism: fan-out test suites across multiple workers, merge results. Large test suites (10,000+ tests) run in parallel shards.
- Build queue: Kafka or SQS. Multiple builds queued; priority queue for main branch builds over feature branches.
Artifact Management
- Tag every artifact with git commit SHA, branch, and build timestamp
- Immutable artifacts: never overwrite — create new artifact per build
- Retention policy: keep last N successful builds per branch; keep all production deploys for 90 days
- Artifact signing: sign Docker images or tarballs to prevent tampered deployments
Rollback Mechanism
- Fast rollback: keep previous artifact version ready; switch load balancer or Kubernetes deployment back in < 60 seconds
- Automatic rollback triggers: error rate > threshold, P99 latency spike, health check failures after deploy
- Database migration rollback: hardest part. Always make migrations backward-compatible (add columns before removing old ones). Maintain migration version in DB.
Observability
- Build logs: stream to centralized log store (Elasticsearch), retained for 30 days
- Deployment events: emit to event bus (PagerDuty, Slack notifications on deploy start/success/failure)
- Deploy dashboard: current version per service, recent deploy history, rollback button
- Metrics: build duration P50/P95/P99, build success rate, deploy frequency (DORA metric)
Interview Checklist
- Draw the full pipeline: push → build → test → artifact → staging → production
- Explain build caching and parallelism for fast builds
- Compare blue/green vs canary vs rolling; know when to use each
- Address rollback: both application-level and database migration rollback
- Mention DORA metrics: deployment frequency, lead time, MTTR, change failure rate