System Design: Identity and Access Management — Authentication, Authorization, and Token Lifecycle

Core Responsibilities

An identity service (also called an IAM or auth service) handles: Authentication (who are you?) — verifying credentials and issuing tokens. Authorization (what can you do?) — checking permissions for specific operations. Token lifecycle — issuing, refreshing, and revoking tokens. User management — account creation, password management, MFA. At companies like Stripe or Cloudflare, the identity service is a foundational internal service that all other services depend on for request authentication and authorization.

Authentication Flow (JWT + Refresh Tokens)

Login: user submits credentials (email + password). Server verifies: SELECT password_hash FROM users WHERE email=:e. bcrypt.verify(input, hash). On success: issue two tokens: (1) Access token (JWT): short-lived (15 minutes), stateless, signed with the identity service’s private key (RS256). Contains: user_id, roles, issued_at, expires_at. (2) Refresh token: long-lived (30 days), opaque random string, stored in the database (refresh_tokens table). The refresh token is stored in an HttpOnly cookie (not accessible by JavaScript — prevents XSS theft). Access token is stored in memory or a non-persistent cookie. Token refresh: when the access token expires, the client sends the refresh token. Server verifies it against the database, issues a new access token, optionally rotates the refresh token (refresh token rotation: each use invalidates the old token and issues a new one — detects theft). Token revocation: revoking the refresh token (on logout, on suspicious activity) prevents further access token issuance. Access tokens are stateless — can’t be revoked before expiry. Short lifetime (15 min) limits the damage of a stolen access token.

Authorization: RBAC and ABAC

RBAC (Role-Based Access Control): users are assigned roles (admin, editor, viewer). Roles have permissions (create_post, delete_user, read_analytics). Permission check: user has role R? Role R has permission P? Stored in: roles table, user_roles table, role_permissions table. Simple and auditable; used in most B2B SaaS products. ABAC (Attribute-Based Access Control): permissions depend on attributes of the user, resource, and environment. Example: “user can edit this document if they are the owner OR if the document is shared with their team.” Policy: can_edit(user, doc) = doc.owner_id == user.id OR user.team_id IN doc.shared_teams. More flexible than RBAC but harder to audit (policies can become complex). Used in Google Drive, AWS IAM (policies with conditions). For most products: start with RBAC. Add ABAC-style conditions when RBAC becomes too coarse.

Multi-Factor Authentication

MFA adds a second factor after password verification. TOTP (Time-based One-Time Password): user enrolls a TOTP app (Google Authenticator, Authy). Server generates a random 20-byte secret, encodes as base32, stores (encrypted) per user. User scans the QR code (encodes the secret). To verify: server computes TOTP(secret, current_30s_window) and compares to the user’s input. Allow ±1 window for clock skew. SMS OTP: generate a 6-digit code, store with a 10-minute TTL, send via SMS. Easier onboarding but vulnerable to SIM swapping. FIDO2/WebAuthn: hardware key or biometrics (Face ID, fingerprint). Cryptographic challenge-response — phishing-resistant. Gold standard for high-security applications (financial, enterprise). MFA bypass codes: generate 8 single-use recovery codes on MFA enrollment. Store as hashed values. User can use one if they lose their MFA device. Invalidate after use.

Token Storage and Security

Token storage on clients: Access token in memory (JavaScript variable): cleared on page refresh; safe from XSS. Refresh token in HttpOnly cookie: not accessible by JS; sent automatically with requests to the same origin. CSRF protection for cookie-based refresh: require a custom header (X-Requested-With: XMLHttpRequest) or a CSRF token. Never store tokens in localStorage: accessible to any JavaScript on the page — XSS attack can steal it. Service-to-service auth: internal services use short-lived JWT signed by the identity service’s private key, or mutual TLS (mTLS). Public key distribution: services cache the identity service’s JWKS (JSON Web Key Set) endpoint and validate JWT signatures locally — no round-trip to the identity service per request. JWKS cache TTL: 5-10 minutes. On key rotation: publish the new key alongside the old key (dual-key period) to avoid race conditions.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why are access tokens short-lived while refresh tokens are long-lived?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Access tokens are used on every API request — they travel over the network frequently and can be intercepted or stolen. If an access token is stolen: the attacker can impersonate the user. Short lifetime (15 minutes) limits the damage window — the stolen token becomes useless in 15 minutes without the refresh token. Refresh tokens are used less frequently (only to get a new access token) and are stored more securely (HttpOnly cookie). If a refresh token is stolen: the attacker can get new access tokens until the refresh token is revoked. This is why revoking refresh tokens (on logout, on suspicious activity, on password change) immediately terminates all active sessions. The combination provides a good security/usability balance: users are not frequently re-authenticated (long refresh token), but exposure from any single stolen access token is limited (short access token lifetime). Alternative: stateful sessions (server stores session state) — easier to revoke, but requires a database lookup on every request. JWT access tokens require no server-side lookup — better for high-throughput APIs.”
}
},
{
“@type”: “Question”,
“name”: “What is PKCE and why is it required for public OAuth clients?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “PKCE (Proof Key for Code Exchange) prevents authorization code interception attacks in OAuth. The problem: in native apps (mobile, desktop), the redirect URI cannot be a secret (the app binary is publicly inspectable). An attacker app could register the same redirect URI and intercept the authorization code. PKCE solution: before the OAuth flow, the client generates a random code_verifier (43-128 chars). It computes code_challenge = base64url(SHA256(code_verifier)). The auth request includes code_challenge. The token exchange includes the original code_verifier. The authorization server verifies SHA256(code_verifier) == stored code_challenge. An intercepted authorization code is useless without the code_verifier (which was never sent to the redirect URI — only the challenge was). PKCE is now required for all public OAuth clients (RFC 9700, OAuth 2.1) and recommended for confidential clients too. In the SPA (single-page app) context: PKCE prevents auth code interception by injected browser extensions or other tabs.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement SSO (Single Sign-On) across multiple applications?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SSO allows a user to log in once and access multiple applications without re-authenticating. Protocols: SAML 2.0 (enterprise, XML-based), OpenID Connect / OAuth 2.0 (modern, JSON/JWT). OIDC SSO flow: (1) User tries to access App A. App A redirects to the Identity Provider (IdP). (2) User authenticates with the IdP (if not already logged in). IdP issues an authorization code and redirects to App A. (3) App A exchanges the code for an ID token and access token. (4) User navigates to App B. App B redirects to the IdP. IdP checks the existing session (SSO session cookie). Without requiring re-authentication: IdP immediately issues tokens for App B. SSO session: the IdP maintains an SSO session (server-side, not just a JWT). The SSO session has a separate lifetime (8 hours for a workday). Applications have their own shorter session lifetimes. Session logout: single logout (SLO) propagates logout to all applications the user accessed via SSO. Complex to implement — most systems just expire all active tokens.”
}
},
{
“@type”: “Question”,
“name”: “How do you prevent credential stuffing attacks?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Credential stuffing: attackers use leaked username/password databases (from other breaches) to try logging in. Because many users reuse passwords across sites, this succeeds at scale. Defenses: (1) Rate limiting: limit failed login attempts per IP (max 5 per minute), per account (max 10 per hour). Exponential backoff for repeated failures. (2) CAPTCHA: trigger after 3 failed attempts from the same IP or account. (3) IP reputation: block known malicious IPs (threat intelligence feeds). Use a WAF or service like Cloudflare Bot Management. (4) Device fingerprinting: new device for an account = send email verification before access. (5) Compromised password detection: check submitted passwords against HaveIBeenPwned’s k-anonymity API (send SHA1 hash prefix, not full hash). If the password appears in breach databases: require the user to change it. (6) Anomaly detection: ML model flags login attempts with unusual velocity, geographic anomalies, or behavioral patterns inconsistent with the account’s history. Alert the user and require MFA.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between authentication and authorization?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Authentication (AuthN): verifying identity — “who are you?” Requires presenting credentials (password, biometric, certificate) and having them verified against a stored secret. Outputs a verified identity claim (user_id, email). Examples: login with password, OAuth login with Google, mTLS certificate validation. Authentication says nothing about what the authenticated entity is allowed to do. Authorization (AuthZ): determining permissions — “what are you allowed to do?” Takes an authenticated identity and checks it against a policy (RBAC roles, ACLs, ABAC policies) for a specific action on a specific resource. Examples: can user 123 delete post 456? Can service A call endpoint /admin/metrics? Authorization is checked on every protected operation, not just at login. Order: always authenticate before authorizing. Without authentication, you don’t have a trustworthy identity to authorize. Common confusion in code: checking if a request has a valid JWT (authentication) is not the same as checking if the JWT’s subject has permission to perform the requested action (authorization). Both checks are needed.”
}
}
]
}

Asked at: Cloudflare Interview Guide

Asked at: Stripe Interview Guide

Asked at: Coinbase Interview Guide

Asked at: Airbnb Interview Guide

Scroll to Top