What Is a Stock Exchange System?
A stock exchange matching engine pairs buy and sell orders in real time, executing trades at the best available price. Core requirements: strict price-time priority (FIFO at each price level), sub-millisecond matching latency, full audit trail, and deterministic replay. NYSE and NASDAQ process 8–10 billion shares per day; modern exchanges target single-digit microsecond matching latency.
Order Types
- Market order: execute immediately at best available price
- Limit order: execute only at specified price or better
- Stop order: becomes a market order when a trigger price is reached
- IOC (Immediate or Cancel): fill as much as possible immediately, cancel remainder
- FOK (Fill or Kill): fill entire order immediately or cancel entirely
Order Book Data Structure
The order book maintains all outstanding limit orders organized by price level. Two sides:
- Bids: buy orders, sorted by price descending (best bid = highest price buyer will pay)
- Asks: sell orders, sorted by price ascending (best ask = lowest price seller will accept)
Data structure: price-level map using a sorted structure (TreeMap / SortedDict) keyed by price. Each price level holds a queue (FIFO) of orders. Operations:
add_order(order) → O(log P) # P = number of price levels
cancel_order(order_id) → O(1) # with order_id → order pointer map
match_order(order) → O(M log P) # M = number of fills
Matching Algorithm
Price-time priority (most common): orders at better prices fill first; at the same price, earlier orders fill first.
def match(incoming_buy_order, asks):
while incoming_buy_order.qty > 0 and asks:
best_ask_price, best_ask_queue = asks.peekmin()
if incoming_buy_order.price 0:
add to bids[incoming_buy_order.price]
Performance Optimizations
For sub-millisecond latency:
- CPU pinning: matching engine thread pinned to a dedicated CPU core — no context switching
- DPDK networking: bypass kernel network stack, process packets directly in userspace
- Lock-free data structures: SPSC (single-producer, single-consumer) ring buffers for order ingestion
- Memory pre-allocation: object pool for Order structs — no GC pauses (Java GC is a killer), use C++ or Rust
- Cache locality: order book stored in contiguous memory; L1 cache hit vs. heap allocation matters at microsecond latency
Trade Matching Log and Replay
Every order event (submit, cancel, modify, fill) is appended to an immutable trade log (journal). The journal is the single source of truth. The in-memory order book is derived state — it can be fully reconstructed by replaying the journal from the beginning. Benefits: disaster recovery (rebuild from log), audit compliance (full history), and deterministic simulation (backtest by replaying historical order flow).
Market Data Distribution
After each match, the exchange broadcasts market data: Level 1 (best bid/ask), Level 2 (full order book depth), and time-and-sales (trade tape). Disseminated via multicast UDP to market data subscribers. UDP used (not TCP) because packet loss is preferable to latency — recipients request retransmission or use sequence number gaps to detect missed data.
Interview Framework
- Order book: TreeMap of price levels, each level is a FIFO queue
- Matching: price-time priority, greedy match against opposite side
- Durability: append-only journal before modifying in-memory book
- Latency: single-threaded matching, CPU pinning, kernel bypass
- Market data: multicast UDP broadcast of level 1/2 and trades
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does an order book work and what data structure implements it efficiently?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “An order book maintains outstanding limit orders organized by price and time. Two sides: bids (buy orders, sorted descending by price) and asks (sell orders, sorted ascending). Best bid = highest buy price; best ask = lowest sell price. When best bid >= best ask, a trade executes. Implementation: outer structure is a sorted dictionary (TreeMap in Java, SortedDict in Python) mapping price → queue of orders. Each queue is FIFO (maintains time priority at same price). Operations: add_order O(log P) where P = price levels; cancel_order O(1) with an order_id → order pointer hash map; match O(M log P) where M = fills. For ultra-low-latency (microseconds): use an array indexed by price (price ladder) for a bounded price range — O(1) access to any price level. The price range for most stocks is bounded (e.g., 10,000 price ticks), making array indexing practical. Java TreeMap + LinkedList per level is the standard interview-level answer.” }
},
{
“@type”: “Question”,
“name”: “How does the matching engine ensure exactly-once trade execution?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “The matching engine is single-threaded to avoid races. All orders are processed sequentially on one thread — no locks needed, no concurrent modification. The sequence: (1) receive order from inbound queue, (2) write to append-only journal (WAL — write-ahead log) before modifying the order book, (3) apply to in-memory order book, (4) publish trade executions to outbound queue. The journal write before state modification is the durability guarantee: if the engine crashes after the journal write but before applying to the book, on restart it replays the journal and re-applies. If it crashes before the journal write, the operation never happened — no partial state. The journal is the ground truth. This is the same WAL pattern used by PostgreSQL and Kafka. For distributed exchanges with hot standby: the primary streams the journal to a standby; on failover, the standby replays any unconfirmed journal entries before going live.” }
},
{
“@type”: “Question”,
“name”: “Why do high-frequency trading systems use DPDK and CPU pinning?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Standard network stack latency path: NIC → kernel interrupt → kernel network stack → kernel buffer → syscall → user space. Each step adds overhead: interrupt latency (microseconds), kernel scheduler latency (context switch), memory copies between kernel and user space. Total: typically 50–200 microseconds for a packet to reach the application. DPDK (Data Plane Development Kit): bypasses the kernel entirely. NIC DMA copies packets directly to user-space ring buffers (hugepages). Application polls the ring buffer in a busy loop (no interrupts, no context switches). Latency: 1–5 microseconds. CPU pinning (core isolation): dedicated CPU cores for the matching engine and NIC polling. Operating system scheduler never preempts these threads. Eliminates scheduler latency (up to 10ms on a loaded system). Combined with NUMA-aware memory allocation (packets and order book on same NUMA node as CPU), these techniques bring end-to-end latency from order receipt to execution to under 10 microseconds. Languages: C++ or Rust — Java GC pauses (1–100ms) are incompatible with microsecond latency targets.” }
}
]
}