← Course Index

Search & Indexing

~25 min · Advanced Patterns · DDIA §3 · Alex Xu Vol 2, Ch 1

Ref
Primary Source
Designing Data-Intensive Applications (DDIA) §3 — "Storage and Retrieval" & Alex Xu Vol 2, Ch 1 — "Nearby Friends"

Explains how full-text inverted indexes work under the hood and explores spatial indexing algorithms like Geohash and Quadtrees.

Why Standard DB Queries Fail Search

If you have a database of millions of products or posts, running a relational search like SELECT * FROM posts WHERE content LIKE '%system design%' is highly inefficient. It triggers a full table scan, loading every record from disk to memory to evaluate the string matching, causing high CPU load and latency.

Similarly, for finding nearby drivers (like Uber) or restaurants (like Yelp), running a 2D range query like WHERE lat BETWEEN x1 AND x2 AND lng BETWEEN y1 AND y2 is slow because traditional database indexes are one-dimensional B-Trees.

1. Full-Text Search: The Inverted Index

An **Inverted Index** is a data structure mapping words (terms) to their locations in document files. This is the core technology behind search engines like Elasticsearch and Apache Solr.

Forward Index (Document → Words)
Doc 1: "System design is fun"
Doc 2: "Learn system architecture"
Doc 3: "Fun scaling patterns"

Slow to search: must scan every 
document to find who contains "Learn".
Inverted Index (Word → Document IDs)
"architecture" → [Doc 2]
"design"       → [Doc 1, Doc 2]
"fun"          → [Doc 1, Doc 3]
"learn"        → [Doc 2]
"scaling"      → [Doc 3]
"system"       → [Doc 1, Doc 2]

Search Pipeline:

  1. Analysis: Text is tokenized, lowercased, and filtered (removing common "stop words" like "is", "and").
  2. Stemming: Reducing words to their base form (e.g. "scaling", "scaled" → "scale").
  3. Posting Lists lookup: The engine retrieves matching Doc ID arrays from the index and intersects/merges them.

2. Geospatial Indexing: 2D Spatial Search

To search geospatial coordinate data efficiently, we need a way to index 2D points (latitude, longitude) into a 1D indexable key. Two common methods are Geohashes and Quadtrees.

2D Space Partitioning Root (World) NW NE SW SE SE1 SE2 SE3 SE4
Quadtree: Dividing 2D grid space recursively into 4 quadrants for spatial lookups
Geospatial PatternHow it WorksBest For
Geohash Divides the Earth into a grid, assigning a Base32 string label to each grid. Longer string prefix = higher resolution (e.g. 9q8yy). Easy caching, simple database indexes (finding matching string prefixes), distributed scaling.
Quadtree An in-memory tree structure. A node splits into 4 child quadrants when the number of points (users/cars) in its area exceeds a threshold. Dynamic grids (e.g., highly populated cities have dense node splits, rural areas are represented by a single large node). Good for Uber/Lyft.
Google S2 Maps 2D coordinates to a 1D Hilbert space-filling curve. Highly complex coverage math, polygon intersections (like Uber geofencing).

Designing for Scale

💡 Interview Trade-off: Yelp or Uber

For a Yelp-like system (restaurants are static, locations don't change), Geohashes stored in standard databases are perfect. For an Uber-like system (drivers are constantly moving, dynamic density), an in-memory Quadtree cluster is a much stronger design recommendation.

Check Your Understanding

1. In full-text engines like Elasticsearch, what does an "Inverted Index" map?
2. When designing a driver-dispatch system with millions of moving drivers, why is an in-memory Quadtree better than a static Geohash database index?
3. What is "Stemming" in a search indexing pipeline?