Geospatial Data Fundamentals
Geospatial queries are common in production systems: “find the 10 nearest restaurants,” “is this user inside a geofence?”, “show all events within 50km.” Standard databases without spatial indexes require full table scans for proximity queries — O(n) at minimum. Geospatial indexes make these queries O(log n) or O(k) (k = result count). Key concepts: Haversine distance: straight-line distance between two lat/lng points on Earth’s surface. Approximate for short distances: distance ≈ sqrt((dlat * 111km)^2 + (dlng * cos(lat) * 111km)^2). Bounding box: a rectangle of lat/lng coordinates used to pre-filter candidates before exact distance calculation. Geohash: encodes a lat/lng into a string of characters where common prefix = geographic proximity. S2 geometry: Google’s library that divides Earth into hierarchical cells for efficient spatial operations.
Proximity Search: PostGIS and Redis
-- PostGIS: find restaurants within 5km of a point
CREATE TABLE restaurants (
id BIGSERIAL PRIMARY KEY,
name TEXT,
location GEOGRAPHY(POINT, 4326) -- lat/lng as geography type
);
CREATE INDEX idx_restaurants_location ON restaurants USING GIST(location);
-- Query: find all restaurants within 5km, sorted by distance
SELECT id, name,
ST_Distance(location,
ST_MakePoint(-73.9857, 40.7484)::GEOGRAPHY) AS dist_meters
FROM restaurants
WHERE ST_DWithin(
location,
ST_MakePoint(-73.9857, 40.7484)::GEOGRAPHY,
5000 -- meters
)
ORDER BY dist_meters
LIMIT 20;
# Redis geospatial: fast proximity search
# Add restaurant locations
redis.geoadd("restaurants", longitude, latitude, restaurant_id)
# Find restaurants within 5km, sorted by distance
results = redis.georadius(
"restaurants",
longitude=-73.9857,
latitude=40.7484,
radius=5,
unit="km",
sort="ASC",
count=20,
withcoord=True,
withdist=True
)
# Returns: [(id, dist_km, (lng, lat)), ...]
Geohash-Based Sharding
Geohash encodes a lat/lng into a string: each character halves the precision bounding box. Geohash prefix = geographic region. Precision: length 4 = ~40km x 20km, length 6 = ~1.2km x 0.6km, length 8 = ~38m x 19m. Sharding by geohash: store entities in shards based on their geohash prefix. Proximity query: compute the geohash of the query point, find all neighboring geohash cells (up to 8 neighbors), query those shards. Limitation: geohash cells at high precision are irregular rectangles (not circles) and neighbors do not always share a prefix. Edge case: locations near a geohash boundary have very different prefixes but are physically close — always check 8 neighbors, not just the same cell. S2 cells (Google) address this with hierarchical hexagonal tessellation that avoids the boundary artifacts of rectangular geohash cells.
Geofencing at Scale
Geofencing: detect when a moving entity (user, vehicle) enters or exits a defined polygon. Naive: for each location update, check if the point is inside any polygon (point-in-polygon test: ray casting). O(k * n) where k = location updates/sec, n = total geofences. Optimized: Bounding box pre-filter: each geofence stores its bounding box (min_lat, max_lat, min_lng, max_lng). Index the bounding boxes in a 2D spatial index (R-tree in PostGIS, or geohash-based). On location update: find all geofences whose bounding box contains the point (fast index lookup). Run exact point-in-polygon test only on those candidates (typically < 10 vs. thousands). State tracking: maintain last known inside/outside status per (entity, geofence) to fire ENTER/EXIT events correctly. Store state in Redis (expires after inactivity). Kafka-based pipeline: location updates → Kafka → geofence consumer → state comparison → ENTER/EXIT events → downstream automation.
Location History and Privacy
Storing fine-grained location history (every GPS update) creates privacy risks and storage costs. Storage architecture: raw location stream → Kafka → location consumer: (1) update Redis current location (TTL = 30s, for real-time features). (2) downsample to 1 update per minute, write to time-series table (TimescaleDB or InfluxDB). (3) summarize to “visited places” (significant location stays, anonymized) for long-term retention. Data retention: raw 1-minute samples: 7 days. Summarized visit data: 1 year. Aggregated/anonymized analytics: indefinite. GDPR compliance: users can request deletion of all location data. Implement a “right to erasure” pipeline: delete raw samples, summarize data, and cached locations within 72 hours. Location data is PII — encrypt at rest, restrict access by role.
See also: Uber Interview Prep
See also: DoorDash Interview Prep
See also: Lyft Interview Prep