System Design: Online Judge — Code Execution, Sandboxing, Test Cases, and Scalable Evaluation

What Is an Online Judge?

An online judge (LeetCode, HackerRank, Codeforces) accepts code submissions in multiple languages, executes them against test cases, and returns results: Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, Runtime Error, or Compilation Error. The core challenges: safe execution of untrusted code (sandboxing), scalability under submission spikes, and accurate verdict computation.

Code Execution and Sandboxing

Untrusted user code can: fork bomb the system, read sensitive files, make network calls, or consume unbounded memory. Isolation layers: (1) Container isolation: run each submission in a Docker container with no network access, read-only filesystem, and resource limits (CPU: 1 core, memory: 256MB). (2) Seccomp (Secure Computing Mode): whitelist only the system calls needed for computation (read, write, exit) — block fork, exec, socket, open. (3) Namespace isolation: separate PID, network, and mount namespaces. (4) Time limit enforcement: use SIGALRM or cgroups CPU quota to kill the process after the time limit (e.g., 2 seconds). (5) Memory limit: cgroups memory.limit_in_bytes kills the process on OOM. Multiple layers provide defense in depth — even if the container is compromised, seccomp prevents dangerous syscalls.

Execution Pipeline

Submission flow: user submits code -> API server validates (language supported, code length enqueue to a message queue (Kafka or SQS) per language -> judge worker picks up the job -> worker spins up a Docker container -> compiles (if compiled language) -> runs against each test case -> collects results -> sends verdict back via a result queue -> API server stores result and updates the user’s submission history.

Test cases: each problem has N test cases (typically 50-200). Run all test cases and return the first failure (or Accepted if all pass). For efficiency: run a few lightweight test cases first (fast feedback). Run heavy test cases last. Test cases are stored in object storage (S3) — workers download them at job start. Cache popular problem test cases in worker local storage.

Language Support

Interpreted languages (Python, JavaScript): compile step is skipped. Execute directly. Compiled languages (C++, Java, Go): compile first, report Compilation Error if it fails, then run the binary. Per-language containers: each language has a dedicated base Docker image with the compiler/runtime pre-installed (warm start). Container pooling: pre-warm N containers per language to avoid cold start overhead on each submission. Return containers to the pool after execution (reset the filesystem). Language-specific time limit adjustments: Python is 3x slower than C++ for the same algorithm — set per-language time limits (C++ 1s, Python 3s).

Scalability

Contest mode: thousands of simultaneous submissions (start of a contest). Scale judge workers horizontally: auto-scale the worker pool based on queue depth. Separate queues per language — prevents a Python submission spike from delaying C++ submissions. Priority queue: submissions for paid users or during contests get higher priority. Judge worker isolation: each worker can only run one submission at a time (CPU-bound) — over-scheduling degrades performance for all. Typical sizing: 1 worker core = 10 submissions/minute. For 1000 submissions/minute: 100 worker cores minimum. Use spot instances for judge workers (70% cheaper, acceptable eviction rate with job re-enqueue).

Result Delivery

Async results: submissions are processed asynchronously. The frontend polls or uses WebSocket to receive the verdict when ready. Client-side: show “Judging…” with progress updates. Server push: when the verdict is ready, push via WebSocket to the client’s browser. Store all submissions in a database: (submission_id, user_id, problem_id, language, code, verdict, runtime_ms, memory_mb, submitted_at). User can view their submission history and replay any submission.

Interview Tips

  • Sandboxing is the core design challenge. Mention at least two isolation layers (Docker + seccomp) — single-layer isolation is insufficient for truly untrusted code.
  • The job queue + worker pool pattern is standard. Emphasize that workers are stateless (any worker can handle any submission) — this enables horizontal scaling.
  • Test case management: test cases are the intellectual property of the platform. Store encrypted in S3; workers decrypt locally. Do not transmit test case outputs to clients (prevents reverse-engineering).

Asked at: Meta Interview Guide

Asked at: Coinbase Interview Guide

Asked at: Databricks Interview Guide

Asked at: Cloudflare Interview Guide

Scroll to Top