System Design: Geospatial Platform — Proximity Search, Geofencing, and Location Data at Scale (2025)

Geospatial Data Fundamentals

Geospatial queries are common in production systems: “find the 10 nearest restaurants,” “is this user inside a geofence?”, “show all events within 50km.” Standard databases without spatial indexes require full table scans for proximity queries — O(n) at minimum. Geospatial indexes make these queries O(log n) or O(k) (k = result count). Key concepts: Haversine distance: straight-line distance between two lat/lng points on Earth’s surface. Approximate for short distances: distance ≈ sqrt((dlat * 111km)^2 + (dlng * cos(lat) * 111km)^2). Bounding box: a rectangle of lat/lng coordinates used to pre-filter candidates before exact distance calculation. Geohash: encodes a lat/lng into a string of characters where common prefix = geographic proximity. S2 geometry: Google’s library that divides Earth into hierarchical cells for efficient spatial operations.

Proximity Search: PostGIS and Redis

-- PostGIS: find restaurants within 5km of a point
CREATE TABLE restaurants (
    id BIGSERIAL PRIMARY KEY,
    name TEXT,
    location GEOGRAPHY(POINT, 4326)  -- lat/lng as geography type
);
CREATE INDEX idx_restaurants_location ON restaurants USING GIST(location);

-- Query: find all restaurants within 5km, sorted by distance
SELECT id, name,
    ST_Distance(location,
                ST_MakePoint(-73.9857, 40.7484)::GEOGRAPHY) AS dist_meters
FROM restaurants
WHERE ST_DWithin(
    location,
    ST_MakePoint(-73.9857, 40.7484)::GEOGRAPHY,
    5000  -- meters
)
ORDER BY dist_meters
LIMIT 20;
# Redis geospatial: fast proximity search
# Add restaurant locations
redis.geoadd("restaurants", longitude, latitude, restaurant_id)

# Find restaurants within 5km, sorted by distance
results = redis.georadius(
    "restaurants",
    longitude=-73.9857,
    latitude=40.7484,
    radius=5,
    unit="km",
    sort="ASC",
    count=20,
    withcoord=True,
    withdist=True
)
# Returns: [(id, dist_km, (lng, lat)), ...]

Geohash-Based Sharding

Geohash encodes a lat/lng into a string: each character halves the precision bounding box. Geohash prefix = geographic region. Precision: length 4 = ~40km x 20km, length 6 = ~1.2km x 0.6km, length 8 = ~38m x 19m. Sharding by geohash: store entities in shards based on their geohash prefix. Proximity query: compute the geohash of the query point, find all neighboring geohash cells (up to 8 neighbors), query those shards. Limitation: geohash cells at high precision are irregular rectangles (not circles) and neighbors do not always share a prefix. Edge case: locations near a geohash boundary have very different prefixes but are physically close — always check 8 neighbors, not just the same cell. S2 cells (Google) address this with hierarchical hexagonal tessellation that avoids the boundary artifacts of rectangular geohash cells.

Geofencing at Scale

Geofencing: detect when a moving entity (user, vehicle) enters or exits a defined polygon. Naive: for each location update, check if the point is inside any polygon (point-in-polygon test: ray casting). O(k * n) where k = location updates/sec, n = total geofences. Optimized: Bounding box pre-filter: each geofence stores its bounding box (min_lat, max_lat, min_lng, max_lng). Index the bounding boxes in a 2D spatial index (R-tree in PostGIS, or geohash-based). On location update: find all geofences whose bounding box contains the point (fast index lookup). Run exact point-in-polygon test only on those candidates (typically < 10 vs. thousands). State tracking: maintain last known inside/outside status per (entity, geofence) to fire ENTER/EXIT events correctly. Store state in Redis (expires after inactivity). Kafka-based pipeline: location updates → Kafka → geofence consumer → state comparison → ENTER/EXIT events → downstream automation.

Location History and Privacy

Storing fine-grained location history (every GPS update) creates privacy risks and storage costs. Storage architecture: raw location stream → Kafka → location consumer: (1) update Redis current location (TTL = 30s, for real-time features). (2) downsample to 1 update per minute, write to time-series table (TimescaleDB or InfluxDB). (3) summarize to “visited places” (significant location stays, anonymized) for long-term retention. Data retention: raw 1-minute samples: 7 days. Summarized visit data: 1 year. Aggregated/anonymized analytics: indefinite. GDPR compliance: users can request deletion of all location data. Implement a “right to erasure” pipeline: delete raw samples, summarize data, and cached locations within 72 hours. Location data is PII — encrypt at rest, restrict access by role.

See also: Uber Interview Prep

See also: DoorDash Interview Prep

See also: Lyft Interview Prep

Scroll to Top