Metrics & Monitoring System
An Experienced Engineer’s Walkthrough for Backend Engineers
Metrics and monitoring give you observability: you need to know if the system is healthy, where it is slow, and when to act. When you’re new, it’s easy to add a few counters or logs and stop there; in production you need a clear model of what to measure (e.g. latency, traffic, errors, saturation), how to store and query it (e.g. time-series DB), and how to turn that into alerts and SLOs so the team knows when to react. The usual building blocks are metrics (counters, gauges, histograms), logs, and traces. This article focuses on metrics: types (counter, gauge, histogram), how they are collected (pull vs push) and stored, how alerting works, and how SLI/SLO and error budget tie to reliability. Stacks like Prometheus + Grafana are common; the concepts apply broadly. No prior experience with Prometheus or SLOs is assumed.
Lesson 1: What Are We Measuring?
Before choosing tools, we need to know what to measure. As a newcomer, it’s easy to add a few random counters (“total requests,” “cache hits”) and then struggle to answer “is the system healthy?” or “why was it slow last night?” The golden signals give a simple framework that works for most services:
- Latency: How long do requests take? You care about percentiles (P50, P99, P999), not just average, because a few slow requests can dominate user experience. Use histograms or summaries.
- Traffic: How many requests per second (or per minute)? Use counters and then
rate()(or equivalent) to get “requests per second.” This tells you load and trends. - Errors: How many 5xx, timeouts, or business errors? Use counters by status code or error type. Error rate (errors / total requests) is what you’ll often put in an SLO.
- Saturation: How full are resources? Queue depth, CPU usage, memory usage. Use gauges. When saturation is high, the system is under stress even if latency hasn’t spiked yet.
From these you derive availability (e.g. % of successful requests), error rate, and latency percentiles — the usual inputs to SLOs and alerting. As an experienced engineer, I start with these four and add more only when we have a clear question (e.g. “which endpoint is slow?” → add labels for endpoint).
Lesson 1 Takeaway
Start with latency, traffic, errors, saturation. Metric types (counter, gauge, histogram) map to these; the rest is collection, storage, and alerting.
Lesson 2: Metric Types
| Type | Meaning | Example |
|---|---|---|
| Counter | Monotonically increasing (total count) | http_requests_total, errors_total |
| Gauge | Current value (can go up or down) | active_connections, queue_depth, memory_usage |
| Histogram | Distribution (buckets + count + sum) | http_request_duration_seconds → P50, P99 |
| Summary | Client-side percentiles (no cross-instance agg) | Similar to histogram but computed at scrape time |
- Counter: Use rate() or irate() to get "per second" or "per minute." Resets (e.g. restart) require care in rate calculation.
- Histogram: Define buckets (e.g. 0.01, 0.05, 0.1, 0.5, 1, 5); server stores count per bucket + total count + sum. Percentiles are computed from buckets. Labels (method, path, status) add dimensions; avoid high-cardinality labels (e.g. user_id) to prevent series explosion.
- Summary: Percentiles computed on the client (e.g. at scrape); no aggregation across instances; use when you need exact percentiles per instance.
Common Metrics Table
| Metric | Type | Description |
|---|---|---|
| http_requests_total | Counter | Total request count |
| http_request_duration_seconds | Histogram | Latency distribution |
| active_connections | Gauge | Current connection count |
| error_rate | Derived | errors / total (e.g. from counters) |
| queue_depth | Gauge | Current backlog size |
Lesson 2 Takeaway
Counters for totals (then rate); gauges for current state; histograms for latency and distributions. Labels = dimensions; keep cardinality low.
Lesson 3: Collection and Architecture
Metrics can be pulled (scraper hits your app) or pushed (app sends to a collector). Prometheus is pull-based; StatsD and many agents are push-based.
High-Level Flow (Pull Model — e.g. Prometheus)
- Apps expose /metrics (or similar) in a standard format (e.g. Prometheus text).
- Prometheus scrapes targets at an interval (e.g. 15s); stores in a time-series DB; evaluates alert rules; sends alerts to Alertmanager (dedupe, group, route to channels).
- Grafana (or similar) queries Prometheus and builds dashboards.
Push Model
- Apps push to a gateway or agent (e.g. StatsD, Prometheus pushgateway for batch jobs). The collector then stores or forwards. Use push when you cannot expose a scrape endpoint (e.g. short-lived jobs).
Lesson 3 Takeaway
Pull (Prometheus scrape) is common for long-lived services; push for batch or ephemeral workloads. Metrics need dimensions (service, instance, endpoint, status) for filtering and aggregation.
Lesson 4: Storage and Cardinality
- Storage: Time-series DB (Prometheus TSDB, InfluxDB, VictoriaMetrics, etc.). Data is stored by series (metric name + labels); retention is typically 15 days to a few weeks; long-term storage (Thanos, Cortex) extends retention.
- Cardinality: Each unique combination of metric + labels is a series. High-cardinality labels (e.g. user_id, request_id) multiply the number of series and can explode storage and query cost. Prefer low-cardinality labels: service, instance, method, status. For high-cardinality data, sample or aggregate; use logs for raw events.
Lesson 4 Takeaway
Labels = dimensions; cardinality = product of label value counts. Avoid high-cardinality labels in metrics; use logs for detailed, high-cardinality data.
Lesson 5: Alerting
Alert Rule (Concept)
- Condition: e.g.
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 - Duration: Require condition to hold for e.g. 5 minutes (
for: 5m) to avoid flapping. - Annotations: Summary and description for runbooks and context.
- Severity: Label alerts (e.g. critical, warning) for routing and prioritization.
Example (Prometheus-style)
YAMLgroups: - name: api rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 for: 5m labels: { severity: critical } annotations: summary: "Error rate above 1%"
- Alertmanager: Deduplicates, groups, and routes alerts to channels (email, Slack, PagerDuty). Inhibition and grouping reduce noise.
Best Practices
- Avoid alert storms: Aggregate by service; use grouping and inhibition; tier alerts (P0/P1/P2).
- Actionable: Every alert should have a runbook or clear next step.
- Tune thresholds: Avoid rules that fire constantly or never; use error budget to set thresholds when possible.
Lesson 5 Takeaway
Alerts = condition + duration + annotations. Actionable alerts with runbooks; cardinality under control so alerting stays manageable.
Lesson 6: SLI, SLO, Error Budget
- SLI (Service Level Indicator): A measurable indicator of behavior (e.g. "percentage of requests with status 2xx" or "percentage of requests with latency < 200ms").
- SLO (Service Level Objective): The target for the SLI (e.g. "99.9% of requests succeed", "P99 latency < 200ms").
- Error budget: 100% − SLO (e.g. 99.9% → 0.1% "budget" for failures). Use it to decide when to pause releases, invest in reliability, or accept risk.
- SLA (Service Level Agreement): Contract with users (may include penalties). SLO is internal target; SLA is external commitment.
Lesson 6 Takeaway
SLI = what you measure; SLO = target; error budget = allowed failure; use it for release and reliability decisions.
Key Rules (Summary)
- Metrics need dimensions (service, instance, endpoint, status) for filtering and aggregation.
- Cardinality: Avoid high-cardinality labels; sample or aggregate when necessary.
- Alerts must be actionable; each alert should have a runbook.
- SLI/SLO: Define what you measure and your target; use error budget for release and reliability decisions.
What's Next
See Distributed Tracing and Java Profiling. See Slow Query and GC for performance monitoring.