System Design Interview: Design a Fraud Detection System

What Is a Fraud Detection System?

A fraud detection system identifies and blocks fraudulent transactions, account takeovers, and abuse in real time. Examples: Stripe Radar, PayPal fraud detection, Google reCAPTCHA. Core challenges: sub-100ms decisions on transactions, handling highly imbalanced data (fraud is 0.1% of transactions), and adapting to adversarial actors who continuously evolve their tactics.

  • DoorDash Interview Guide
  • Lyft Interview Guide
  • Meta Interview Guide
  • Airbnb Interview Guide
  • Uber Interview Guide
  • Shopify Interview Guide
  • Coinbase Interview Guide
  • Stripe Interview Guide
  • System Requirements

    Functional

    • Score each transaction in real time: allow, flag for review, or block
    • Account takeover detection: unusual login patterns
    • Rules engine: configurable business rules without code deploys
    • Case management: human review of flagged transactions
    • Model retraining pipeline as new fraud patterns emerge

    Non-Functional

    • 10K transactions/second, decision in <100ms
    • False positive rate <1% (legitimate transactions blocked)
    • High recall on fraud (miss as little fraud as possible)

    Decision Pipeline

    Transaction arrives
           │
      [Blocklist check] ── O(1), known bad cards/IPs → block
           │
      [Rules engine] ── configurable rules, O(ms) → allow/block/flag
           │
      [ML model] ── gradient boosting or neural net, O(10ms) → fraud score
           │
      [Decision] ── score threshold → allow/review queue/block
           │
      [Velocity checks] ── post-decision async update of counters
    

    Feature Engineering

    Fraud models rely on real-time features computed at decision time:

    • Transaction features: amount, merchant category, currency, time of day
    • Velocity features: transactions in last 1/5/60 minutes for this card/account/IP
    • Historical features: average transaction amount for this user (last 30 days), most common merchant categories
    • Network features: how many accounts share this device fingerprint, this IP, this email domain
    • Behavioral features: typing speed, mouse movement entropy (bot vs human)

    Velocity features require real-time counters. Store in Redis: INCR tx_count:{card_id}:{minute_bucket}. TTL = 60 minutes. Query the last 5 buckets to get 5-minute velocity.

    Rules Engine

    Rules are configured by fraud analysts without code deploys. Examples:

    • Block if: country = high-risk AND amount > $500 AND account_age < 7 days
    • Flag if: 3+ transactions in 10 minutes from different countries
    • Allow: if user has 2-year history AND verified bank account AND amount < $100

    Implement as a decision tree evaluated against the transaction feature vector. Store rules in a DB; load into memory on change (hot reload). Rules run before ML to handle obvious cases cheaply.

    ML Model

    Gradient Boosted Decision Trees (XGBoost, LightGBM) work well for tabular fraud data: handle missing values, require no feature normalization, interpretable feature importances. Training data is highly imbalanced (1K fraud vs 999K legit). Techniques: oversampling fraud (SMOTE), undersampling legit, class_weight parameter. Evaluate with precision-recall curve (not accuracy — 99.9% accuracy by always predicting “legit” is meaningless). Optimize for F1 or business-specific cost function (cost of false positive = lost revenue, cost of false negative = fraud loss).

    Feature Store

    Features must be consistent between training and serving. A feature store provides: offline features (batch computed, e.g., 30-day average amount per user) and online features (real-time computed, e.g., velocity in last 5 minutes). At serving time: fetch both from the feature store, concatenate, feed to model. This prevents training-serving skew — the most common ML production failure mode.

    Feedback Loop

    Fraud decisions generate labels: a blocked transaction later confirmed as fraud (true positive) or as legitimate (false positive). These labels feed back into the training pipeline. Weekly retraining cycle: collect last 7 days of labeled decisions, retrain model, A/B test against production model (shadow mode), promote if better. Alert on model performance degradation: if recall drops below threshold, retrain immediately.

    Account Takeover Detection

    Signals: login from new device/IP, login from different country than usual, password reset followed immediately by transaction, simultaneous sessions from different geolocations. On suspicious login: require step-up authentication (SMS OTP, email verification) before allowing transactions.

    Interview Tips

    • Multi-layer pipeline: blocklist → rules → ML — cheap checks first.
    • Feature store prevents training-serving skew — a key production ML concept.
    • Imbalanced data: always discuss SMOTE, class weights, and evaluation metric (not accuracy).
    • Feedback loop closes the system — without it, the model decays as fraud patterns evolve.

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What is a feature store and why does it prevent training-serving skew in fraud detection?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Training-serving skew is the most common production ML failure: the model is trained on features computed one way but served using features computed differently, causing the model to behave unexpectedly in production. In fraud detection: a model trained on "transactions in last 5 minutes" computed from historical logs must see the exact same feature at serving time — but the production system computes velocity from Redis counters, not logs. If the computation differs even slightly (e.g., different time bucketing), the model receives out-of-distribution features and performs poorly. A feature store solves this by providing a single source of feature definitions shared between training and serving. Offline store (e.g., S3 + Spark): stores historical feature values for training. Online store (e.g., Redis, DynamoDB): serves low-latency feature values for inference. The same feature pipeline code writes to both stores. When the model requests "5-minute transaction count for card X," it gets the same value whether training or serving. This is foundational to reliable ML systems — Uber, DoorDash, and Stripe all cite feature stores as critical infrastructure.” }
    },
    {
    “@type”: “Question”,
    “name”: “How do you evaluate a fraud detection model and why is accuracy the wrong metric?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Accuracy is misleading for imbalanced fraud data. If 0.1% of transactions are fraud, a model that always predicts "not fraud" is 99.9% accurate but catches zero fraud. The correct metrics: Precision = true positives / (true positives + false positives). How many flagged transactions are actually fraud? Low precision = many legitimate transactions blocked (false positives, bad UX). Recall = true positives / (true positives + false negatives). What fraction of actual fraud did we catch? Low recall = fraud slips through. F1 Score = 2 * precision * recall / (precision + recall). Harmonic mean, balances both. In practice: use a precision-recall curve and choose the operating point based on business cost. The cost asymmetry: false positive (blocking a good transaction) costs ~$5 in lost revenue and customer churn. False negative (missing fraud) costs $100+ in chargebacks and liability. This asymmetry means you should accept somewhat lower precision to achieve high recall. Use AUCPR (area under precision-recall curve) as the overall model quality metric for imbalanced datasets, not AUROC.” }
    },
    {
    “@type”: “Question”,
    “name”: “How do velocity checks catch fraud and how do you implement them at 10K transactions/second?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Velocity checks flag unusual transaction rates for a card, account, or IP address. Examples: card used 5 times in 2 minutes (rapid sequential fraud), same IP making 100 transactions in an hour (bot activity), card used in 3 countries in 30 minutes (physically impossible travel). Implementation with Redis sliding window counters: for each transaction, increment multiple counters. INCR tx_count:{card_id}:{minute_bucket} with TTL = 75 minutes. To get 5-minute velocity: sum the last 5 minute buckets in O(1) using MGET. Store in Redis: key = "vel:{entity_type}:{entity_id}:{minute_bucket}", value = count. Lookups for 5 entities (card, account, email, IP, device) across 5 time windows = 25 Redis operations per transaction. At 10K TPS: 250K Redis operations/second. A 3-node Redis cluster with replication handles 500K ops/sec easily. The velocity feature vector (5 entities × 5 windows = 25 features) is computed in ~2ms and feeds directly into the ML model as input features.” }
    }
    ]
    }

    Scroll to Top