Search & Indexing

~25 min · Advanced Patterns · DDIA §3 · Alex Xu Vol 2, Ch 1

Ref

Primary Source

Designing Data-Intensive Applications (DDIA) §3 — "Storage and Retrieval" & Alex Xu Vol 2, Ch 1 — "Nearby Friends"

Explains how full-text inverted indexes work under the hood and explores spatial indexing algorithms like Geohash and Quadtrees.

Why Standard DB Queries Fail Search

If you have a database of millions of products or posts, running a relational search like SELECT * FROM posts WHERE content LIKE '%system design%' is highly inefficient. It triggers a full table scan, loading every record from disk to memory to evaluate the string matching, causing high CPU load and latency.

Similarly, for finding nearby drivers (like Uber) or restaurants (like Yelp), running a 2D range query like WHERE lat BETWEEN x1 AND x2 AND lng BETWEEN y1 AND y2 is slow because traditional database indexes are one-dimensional B-Trees.

1. Full-Text Search: The Inverted Index

An **Inverted Index** is a data structure mapping words (terms) to their locations in document files. This is the core technology behind search engines like Elasticsearch and Apache Solr.

Forward Index (Document → Words)

Doc 1: "System design is fun"
Doc 2: "Learn system architecture"
Doc 3: "Fun scaling patterns"

Slow to search: must scan every 
document to find who contains "Learn".

Inverted Index (Word → Document IDs)

"architecture" → [Doc 2]
"design"       → [Doc 1, Doc 2]
"fun"          → [Doc 1, Doc 3]
"learn"        → [Doc 2]
"scaling"      → [Doc 3]
"system"       → [Doc 1, Doc 2]

Search Pipeline:

Analysis: Text is tokenized, lowercased, and filtered (removing common "stop words" like "is", "and").
Stemming: Reducing words to their base form (e.g. "scaling", "scaled" → "scale").
Posting Lists lookup: The engine retrieves matching Doc ID arrays from the index and intersects/merges them.

2. Geospatial Indexing: 2D Spatial Search

To search geospatial coordinate data efficiently, we need a way to index 2D points (latitude, longitude) into a 1D indexable key. Two common methods are Geohashes and Quadtrees.

Quadtree: Dividing 2D grid space recursively into 4 quadrants for spatial lookups

Geospatial Pattern	How it Works	Best For
Geohash	Divides the Earth into a grid, assigning a Base32 string label to each grid. Longer string prefix = higher resolution (e.g. `9q8yy`).	Easy caching, simple database indexes (finding matching string prefixes), distributed scaling.
Quadtree	An in-memory tree structure. A node splits into 4 child quadrants when the number of points (users/cars) in its area exceeds a threshold.	Dynamic grids (e.g., highly populated cities have dense node splits, rural areas are represented by a single large node). Good for Uber/Lyft.
Google S2	Maps 2D coordinates to a 1D Hilbert space-filling curve.	Highly complex coverage math, polygon intersections (like Uber geofencing).

Designing for Scale

Search Clusters (Sharding): Large search systems partition document indexes across multiple instances. Typically sharded by Document ID (term query needs fan-out to all shards) or sharded by Term (which creates bottlenecks for hot search terms).
Dynamic Updates: Inverted indexes are highly optimized for reads and immutable on disk (SSTables/segments). Rather than rebuilding index files on every write, search databases write updates to an in-memory buffer (MemTable) and merge changes in the background (Log-Structured Merge / Segment Merging).

💡 Interview Trade-off: Yelp or Uber

For a Yelp-like system (restaurants are static, locations don't change), Geohashes stored in standard databases are perfect. For an Uber-like system (drivers are constantly moving, dynamic density), an in-memory Quadtree cluster is a much stronger design recommendation.

Check Your Understanding

1. In full-text engines like Elasticsearch, what does an "Inverted Index" map?

2. When designing a driver-dispatch system with millions of moving drivers, why is an in-memory Quadtree better than a static Geohash database index?

3. What is "Stemming" in a search indexing pipeline?