Covers building a scalable metrics collection pipeline, time-series data modeling, storage engine choice, and alerting rules.
In a large distributed system, components fail constantly. Monitoring helps answer: Is the system working? Observability helps answer: Why is it broken? It is grounded in the Three Pillars:
Numeric values measured over time.
- Low memory/storage overhead.
- Excellent for real-time alerting.
- Examples: CPU load, QPS,
error rates, JVM memory.
- Logs: Text events with timestamps.
Detailed but expensive to index.
- Traces: End-to-end paths of requests
as they traverse microservices.
Identifies downstream bottlenecks.
To measure availability objectively, Google site reliability engineering (SRE) practices define three levels:
A time-series database (TSDB) like Prometheus or InfluxDB stores metrics. They optimize for fast numeric appending and aggregate query performance (e.g. computing average CPU for the last 10 minutes).
Metrics collector initiates HTTP GET
requests to app endpoints (/metrics) at
configured intervals.
✓ Simple: targets don't need configuration.
✓ Server controls scraping rate.
✗ Harder to monitor ephemeral serverless
jobs (requires a PushGateway middleman).
App instances actively send TCP/UDP packets
to a central metrics collector daemon.
✓ Ideal for short-lived cron jobs or lambda
functions.
✗ Scale issue: microservice spikes can
accidentally DDoS the metrics collector.
✗ Complex agent installation on every host.