Monitoring & Observability

~20 min · Advanced Patterns · Alex Xu Vol 2, Ch 5 · DDIA §1

Ref

Primary Source

Alex Xu Vol 2, Chapter 5 — "Metrics Monitoring and Alerting System"

Covers building a scalable metrics collection pipeline, time-series data modeling, storage engine choice, and alerting rules.

What is Observability?

In a large distributed system, components fail constantly. Monitoring helps answer: Is the system working? Observability helps answer: Why is it broken? It is grounded in the Three Pillars:

1. Metrics (What is happening)

Numeric values measured over time.
- Low memory/storage overhead.
- Excellent for real-time alerting.
- Examples: CPU load, QPS, 
  error rates, JVM memory.

2. Logs & Traces (Where / Why)

- Logs: Text events with timestamps. 
  Detailed but expensive to index.
- Traces: End-to-end paths of requests 
  as they traverse microservices.
  Identifies downstream bottlenecks.

SLI vs SLO vs SLA

To measure availability objectively, Google site reliability engineering (SRE) practices define three levels:

SLI (Service Level Indicator): The quantitative measure of service performance.
Example: "The latency of successful HTTP GET requests must be measured."
SLO (Service Level Objective): The target reliability goal defined by your SLIs.
Example: "99.9% of HTTP GET requests must have a latency < 200 ms."
SLA (Service Level Agreement): The legal contract with customers outlining what happens if SLOs are missed (refunds, penalties).
Example: "If availability falls below 99.9%, we credit back 10% of billings."

Metrics Pipeline Architecture

A time-series database (TSDB) like Prometheus or InfluxDB stores metrics. They optimize for fast numeric appending and aggregate query performance (e.g. computing average CPU for the last 10 minutes).

Metrics collection pipeline: scraper queries target nodes periodically and saves values in a Time-Series Database

Pull vs Push Models

Pull Model (Prometheus)

Metrics collector initiates HTTP GET 
requests to app endpoints (/metrics) at 
configured intervals.

✓ Simple: targets don't need configuration.
✓ Server controls scraping rate.
✗ Harder to monitor ephemeral serverless 
  jobs (requires a PushGateway middleman).

Push Model (Datadog, StatsD)

App instances actively send TCP/UDP packets 
to a central metrics collector daemon.

✓ Ideal for short-lived cron jobs or lambda 
  functions.
✗ Scale issue: microservice spikes can 
  accidentally DDoS the metrics collector.
✗ Complex agent installation on every host.

Check Your Understanding

1. What is a key characteristic of metrics that distinguishes them from logs?

2. You are writing a legal document guaranteeing a client that your database availability will not drop below 99.99%. What document is this?

3. What database engine type is optimized specifically for writing and querying measurements like CPU load or API counts over time?