System Design Interview: Microservices and Service Mesh (Envoy, Istio, mTLS)

What Is a Service Mesh?

A service mesh is an infrastructure layer that handles service-to-service communication in a microservices architecture. Instead of embedding networking logic (retries, mTLS, circuit breaking, tracing) in each service, a service mesh externalizes it into sidecar proxies deployed alongside every service instance. Envoy is the dominant sidecar proxy; Istio and Linkerd are the most common control planes that manage the sidecar fleet.

  • Stripe Interview Guide
  • LinkedIn Interview Guide
  • Cloudflare Interview Guide
  • Airbnb Interview Guide
  • Uber Interview Guide
  • Netflix Interview Guide
  • Why Microservices Fail Without a Mesh

    • Duplicate code: every team reimplements retry logic, timeouts, circuit breakers
    • No mTLS: east-west traffic between services is unencrypted and unauthenticated
    • Observability gaps: no uniform distributed tracing or request metrics without instrumentation in every service
    • Service discovery coupling: services hardcode addresses or depend on a shared client library

    Sidecar Proxy Architecture

    Each pod runs two containers: the application container and an Envoy sidecar. iptables rules intercept all inbound and outbound traffic and redirect it through Envoy. The application connects to localhost; Envoy handles everything else: load balancing, retries, circuit breaking, mTLS termination, and emitting metrics/traces.

    The control plane (Istio’s istiod) pushes configuration to all Envoy sidecars via xDS protocol. Service discovery data, routing rules, and certificate rotation all flow through this control plane channel.

    Service Discovery

    Two models:

    • Client-side discovery: the client queries a service registry (Consul, Eureka) and picks an instance. Simple, but discovery logic is in every client.
    • Server-side discovery: the client sends to a load balancer/proxy (Envoy, AWS ALB). The proxy queries the registry and routes. Client is dumb — no discovery SDK needed.

    Service mesh uses server-side discovery via the sidecar. Kubernetes uses DNS + kube-proxy for basic discovery; Istio replaces kube-proxy routing with Envoy for advanced policies.

    mTLS — Mutual TLS

    Every service gets a short-lived X.509 certificate provisioned by the control plane (SPIFFE/SPIRE identity). Envoy validates the peer certificate on every connection — both sides authenticate. Benefits: (1) encryption of all east-west traffic, (2) strong service identity — services can’t impersonate each other, (3) zero-trust networking — firewall rules are no longer the only perimeter. Certificate rotation is automatic and transparent to the application.

    Circuit Breaker in the Mesh

    Configured declaratively in Istio DestinationRule:

    outlierDetection:
      consecutiveErrors: 5          # open after 5 consecutive 5xx
      interval: 10s                 # evaluation window
      baseEjectionTime: 30s         # how long to eject the host
      maxEjectionPercent: 50        # at most 50% of hosts ejected
    

    Envoy ejects unhealthy endpoints from the load balancing pool. Traffic reroutes to healthy instances. After baseEjectionTime, the host is probed (one request). If it succeeds, it rejoins the pool.

    Traffic Management

    • Canary deployment: weight 95% to v1, 5% to v2 — controlled via VirtualService
    • Header-based routing: route requests with X-User-Beta: true to v2
    • Fault injection: inject 10% HTTP 500s or 100ms delays for chaos testing
    • Retry policy: retry on 503, up to 3 times, 25ms retry interval

    Observability

    Envoy emits metrics (Prometheus), access logs (Loki/Splunk), and trace spans (Zipkin/Jaeger) for every request — without any instrumentation in the application. You get RED metrics (Rate, Errors, Duration) for every service-to-service call automatically.

    Interview Framework

    1. How do services discover each other? Client-side vs. server-side vs. mesh.
    2. How is east-west traffic secured? mTLS via sidecar.
    3. How do you prevent cascade failures? Circuit breaker + bulkhead in Envoy.
    4. How do you deploy changes safely? Canary via weighted routing in VirtualService.
    5. How do you observe distributed requests? Distributed tracing injected by Envoy.

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What is the difference between a service mesh and an API gateway?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “An API gateway handles north-south traffic (external clients to internal services): authentication, rate limiting, SSL termination, routing, API versioning. It is an entry point — one per cluster or per API surface. A service mesh handles east-west traffic (service to service inside the cluster): mTLS, retries, circuit breaking, distributed tracing, canary deployments. It operates transparently via sidecar proxies without changing application code. They complement each other: API gateway at the edge, service mesh inside the cluster. Kong and AWS API Gateway for external traffic; Istio/Linkerd for internal. Confusion arises because both do routing and traffic management — the scope is what differs.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does Envoy implement circuit breaking and what are the configuration parameters?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Envoy circuit breaking operates at the connection pool and outlier detection level. Connection pool limits: max_connections (TCP), max_pending_requests (HTTP/1), max_requests (HTTP/2), max_retries — these prevent a single slow upstream from consuming all resources. Outlier detection (circuit breaker): consecutiveGatewayErrors (5xx count before ejection), interval (evaluation window, e.g., 10s), baseEjectionTime (how long the host is ejected, e.g., 30s), maxEjectionPercent (max fraction of hosts ejectable at once, e.g., 50%). When a host exceeds consecutiveErrors within the interval, it is ejected from the load balancing pool. After baseEjectionTime, one probe request is sent — if successful, the host rejoins. This is the half-open state in standard circuit breaker terminology.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does mTLS work in a service mesh and what identity system does it use?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “In a service mesh, every workload gets a cryptographic identity via SPIFFE (Secure Production Identity Framework For Everyone). The identity is encoded in an X.509 certificate as a SPIFFE URI: spiffe://cluster.local/ns/default/sa/payment-service. The control plane (Istio's istiod or SPIRE) acts as a certificate authority: it signs short-lived certificates (default 24h) for each service account. Envoy sidecars handle TLS handshakes — both sides present and validate certificates. mTLS provides: (1) encryption of all traffic, (2) strong service identity — payment-service can only accept connections from authorized callers, not just any pod in the cluster, (3) zero-trust networking — network-level firewall rules alone are no longer sufficient. Certificate rotation is automatic and transparent to the application process.” }
    }
    ]
    }

    Scroll to Top