How do you prevent malicious code from harming the judge server?

Run each submission in a Linux sandbox using namespaces and cgroups: network namespace (no internet access), mount namespace (read-only filesystem except /tmp), PID namespace (isolated process tree), memory cgroup (enforce memory limit), CPU time via SIGKILL after the time limit. Tools like isolate or nsjail implement these controls. Pre-warmed container pools avoid cold-start latency while maintaining isolation.

Why use a queue for submission processing instead of synchronous execution?

Code execution is CPU-intensive and takes up to several seconds per submission. Processing synchronously in the API server would block the request thread for seconds and make the API unresponsive under load. A queue decouples the API (fast, returns immediately with a submission ID) from the judges (slow, execute independently). The queue also absorbs burst traffic — at a contest start, thousands of submissions arrive simultaneously and are processed at the judges' own pace.

What is a special judge (checker) and when is it needed?

A special judge is used when multiple correct outputs are possible and exact string comparison would reject valid answers. Examples: any valid topological sort, any shortest path (when multiple exist), floating-point answers within epsilon, problems where output order doesn't matter. The checker is a trusted program provided by the problem setter that takes (input, expected_output, user_output) and returns ACCEPTED or WRONG_ANSWER with a message.

How do you measure CPU time and memory usage for submitted code?

CPU time: read /proc/{pid}/stat (utime + stime fields) or use getrusage() after the process exits. This gives CPU time consumed, not wall-clock time affected by system load. Memory: read peak RSS from /proc/{pid}/status (VmPeak or VmRSS) or from cgroup memory.max_usage_in_bytes. The sandbox enforces hard limits via setrlimit (RLIMIT_CPU, RLIMIT_AS) or cgroup limits, killing the process if exceeded.

How does an ICPC-style penalty scoring system work?

In ICPC scoring: teams are ranked by number of problems solved (descending), then by total penalty time (ascending). For each accepted problem, penalty = time of first acceptance (in minutes from contest start) + 20 minutes * number of wrong attempts before acceptance. Wrong attempts on unsolved problems incur no penalty. This rewards both speed and accuracy — excessive wrong guessing is penalized even if the problem is eventually solved.

Low-Level Design: Online Judge System — Submission Processing, Sandboxed Execution, and Scoring

⏱ 5 min read

Core Entities

Problem: problem_id, title, slug, difficulty (EASY, MEDIUM, HARD), statement_markdown, constraints, time_limit_ms, memory_limit_mb, tags (array), is_active. TestCase: tc_id, problem_id, input (text), expected_output (text), is_sample (visible to user), weight (for partial scoring). Submission: submission_id, problem_id, user_id, language (PYTHON3, CPP17, JAVA21, GO), code (text), status (PENDING, RUNNING, ACCEPTED, WRONG_ANSWER, TIME_LIMIT_EXCEEDED, MEMORY_LIMIT_EXCEEDED, RUNTIME_ERROR, COMPILATION_ERROR), runtime_ms, memory_mb, submitted_at, judged_at. Contest: contest_id, title, start_time, end_time, type (ICPC, IOI, VIRTUAL). ContestSubmission: contest_submission_id, contest_id, submission_id, penalty_minutes, score. UserStats: user_id, problems_solved, acceptance_rate, ranking, rating.

Submission Processing Pipeline

Flow: (1) User submits code via the web editor. (2) API validates input (language supported, code length ≤ 64KB). Creates a Submission record with status=PENDING. Returns submission_id immediately — client polls for results. (3) API publishes submission job to a queue (Kafka, SQS): {submission_id, problem_id, language, code}. (4) A Judge Worker consumes the job. (5) Worker fetches test cases for the problem. (6) Worker compiles the code (for compiled languages: C++, Java). Compilation timeout: 10 seconds. If compilation fails: status=COMPILATION_ERROR, store compiler output. (7) Worker executes code against each test case in the sandbox. (8) After all test cases: status = ACCEPTED (all pass) or WRONG_ANSWER / TLE / MLE / RE. (9) Update Submission record. Publish result event. Client receives the result via WebSocket or polling. Polling fallback: GET /submissions/{id} every 2 seconds. WebSocket: server pushes when job completes. Queue-based architecture decouples the API from the computation-heavy judging. The queue absorbs bursts (a contest start sends thousands of submissions simultaneously).

Sandboxed Execution

Submitted code is untrusted — it can attempt to access the filesystem, network, fork-bomb, or consume unbounded memory. Sandbox requirements: (1) CPU time limit enforcement (SIGKILL after time_limit_ms). (2) Memory limit enforcement (cgroup memory controller or setrlimit). (3) No filesystem write access (except /tmp with size limit). (4) No network access (network namespace isolation). (5) No fork beyond a small limit (prevent fork bombs). (6) No root privileges (run as unprivileged user). Implementation options: Linux namespaces + cgroups (isolate process, CPU, memory, network, filesystem). Tools: isolate (used by competitive programming judges), nsjail (Google’s sandbox used for CTFs), gVisor (kernel-level sandbox). Docker containers: each submission runs in a fresh container with resource limits. Slower startup (~200ms) but simpler to manage. Language-specific sandboxes: for Python, use RestrictedPython or compile to a secure subset. For C++: compile natively, run in process sandbox. Standard approach: pre-warm a pool of sandbox containers per language. Assign a container from the pool, run the submission, reset the container, return it to the pool. Avoids cold-start overhead.

Test Case Execution and Comparison

Execute submission against each test case independently. Input/output: pipe the test case input to stdin; capture stdout. Comparison: exact string match (trim trailing whitespace and newlines). Special judge (checker): some problems require custom output validation (e.g., any valid topological sort is acceptable). A checker program takes (input, expected_output, actual_output) and returns ACCEPTED or WRONG_ANSWER. The checker itself is trusted code (provided by the problem setter). Time measurement: measure wall-clock time and CPU time separately. Use CPU time for limit enforcement (wall-clock is affected by system load). Memory measurement: peak RSS (Resident Set Size) via /proc/{pid}/status or cgroup memory accounting. Partial scoring (IOI-style): each test case has a weight. Score = sum of weights of passed test cases. ACCEPTED = 100% of test cases passed. For educational purposes: show which sample test cases failed (do not show hidden test case inputs/outputs to prevent cheating).

Leaderboard and Rating System

Contest leaderboard: rank by problems solved DESC, then by total penalty time ASC. Penalty: for each accepted problem: time_of_acceptance + 20 minutes per wrong attempt before acceptance. Real-time leaderboard: updated on each submission. Store in Redis sorted set: ZADD leaderboard:{contest_id} {score} {user_id}. Score encoding: encode (problems_solved, -penalty) as a single float for ZRANGEBYSCORE. Or maintain a separate sorted structure. Rating system: ELO-based (like Codeforces). Expected performance based on current rating vs. other contest participants. Actual performance based on rank. Rating delta = K * (actual – expected). K varies by experience level (larger for new participants, smaller for established). Rating stored in UserStats, updated after each contest ends (batch job). Submission history: store all submissions permanently for audit, appeals, and user profile display.