← Course Index

Monitoring & Observability

~20 min · Advanced Patterns · Alex Xu Vol 2, Ch 5 · DDIA §1

Ref
Primary Source
Alex Xu Vol 2, Chapter 5 — "Metrics Monitoring and Alerting System"

Covers building a scalable metrics collection pipeline, time-series data modeling, storage engine choice, and alerting rules.

What is Observability?

In a large distributed system, components fail constantly. Monitoring helps answer: Is the system working? Observability helps answer: Why is it broken? It is grounded in the Three Pillars:

1. Metrics (What is happening)
Numeric values measured over time.
- Low memory/storage overhead.
- Excellent for real-time alerting.
- Examples: CPU load, QPS, 
  error rates, JVM memory.
2. Logs & Traces (Where / Why)
- Logs: Text events with timestamps. 
  Detailed but expensive to index.
- Traces: End-to-end paths of requests 
  as they traverse microservices.
  Identifies downstream bottlenecks.

SLI vs SLO vs SLA

To measure availability objectively, Google site reliability engineering (SRE) practices define three levels:

Metrics Pipeline Architecture

A time-series database (TSDB) like Prometheus or InfluxDB stores metrics. They optimize for fast numeric appending and aggregate query performance (e.g. computing average CPU for the last 10 minutes).

App Server 1 App Server 2 App Server 3 Collector / Scraper TSDB (Prometheus) Grafana (UI) Alert Manager Pull metrics Write
Metrics collection pipeline: scraper queries target nodes periodically and saves values in a Time-Series Database

Pull vs Push Models

Pull Model (Prometheus)
Metrics collector initiates HTTP GET 
requests to app endpoints (/metrics) at 
configured intervals.

✓ Simple: targets don't need configuration.
✓ Server controls scraping rate.
✗ Harder to monitor ephemeral serverless 
  jobs (requires a PushGateway middleman).
Push Model (Datadog, StatsD)
App instances actively send TCP/UDP packets 
to a central metrics collector daemon.

✓ Ideal for short-lived cron jobs or lambda 
  functions.
✗ Scale issue: microservice spikes can 
  accidentally DDoS the metrics collector.
✗ Complex agent installation on every host.

Check Your Understanding

1. What is a key characteristic of metrics that distinguishes them from logs?
2. You are writing a legal document guaranteeing a client that your database availability will not drop below 99.99%. What document is this?
3. What database engine type is optimized specifically for writing and querying measurements like CPU load or API counts over time?