Metrics & Monitoring System

An Experienced Engineer’s Walkthrough for Backend Engineers

Metrics and monitoring give you observability: you need to know if the system is healthy, where it is slow, and when to act. When you’re new, it’s easy to add a few counters or logs and stop there; in production you need a clear model of what to measure (e.g. latency, traffic, errors, saturation), how to store and query it (e.g. time-series DB), and how to turn that into alerts and SLOs so the team knows when to react. The usual building blocks are metrics (counters, gauges, histograms), logs, and traces. This article focuses on metrics: types (counter, gauge, histogram), how they are collected (pull vs push) and stored, how alerting works, and how SLI/SLO and error budget tie to reliability. Stacks like Prometheus + Grafana are common; the concepts apply broadly. No prior experience with Prometheus or SLOs is assumed.

Lesson 1: What Are We Measuring?

Before choosing tools, we need to know what to measure. As a newcomer, it’s easy to add a few random counters (“total requests,” “cache hits”) and then struggle to answer “is the system healthy?” or “why was it slow last night?” The golden signals give a simple framework that works for most services:

Latency: How long do requests take? You care about percentiles (P50, P99, P999), not just average, because a few slow requests can dominate user experience. Use histograms or summaries.
Traffic: How many requests per second (or per minute)? Use counters and then rate() (or equivalent) to get “requests per second.” This tells you load and trends.
Errors: How many 5xx, timeouts, or business errors? Use counters by status code or error type. Error rate (errors / total requests) is what you’ll often put in an SLO.
Saturation: How full are resources? Queue depth, CPU usage, memory usage. Use gauges. When saturation is high, the system is under stress even if latency hasn’t spiked yet.

From these you derive availability (e.g. % of successful requests), error rate, and latency percentiles — the usual inputs to SLOs and alerting. As an experienced engineer, I start with these four and add more only when we have a clear question (e.g. “which endpoint is slow?” → add labels for endpoint).

Lesson 1 Takeaway

Start with latency, traffic, errors, saturation. Metric types (counter, gauge, histogram) map to these; the rest is collection, storage, and alerting.

Lesson 2: Metric Types

Type	Meaning	Example
Counter	Monotonically increasing (total count)	`http_requests_total`, `errors_total`
Gauge	Current value (can go up or down)	`active_connections`, `queue_depth`, `memory_usage`
Histogram	Distribution (buckets + count + sum)	`http_request_duration_seconds` → P50, P99
Summary	Client-side percentiles (no cross-instance agg)	Similar to histogram but computed at scrape time

Counter: Use rate() or irate() to get "per second" or "per minute." Resets (e.g. restart) require care in rate calculation.
Histogram: Define buckets (e.g. 0.01, 0.05, 0.1, 0.5, 1, 5); server stores count per bucket + total count + sum. Percentiles are computed from buckets. Labels (method, path, status) add dimensions; avoid high-cardinality labels (e.g. user_id) to prevent series explosion.
Summary: Percentiles computed on the client (e.g. at scrape); no aggregation across instances; use when you need exact percentiles per instance.

Common Metrics Table

Metric	Type	Description
http_requests_total	Counter	Total request count
http_request_duration_seconds	Histogram	Latency distribution
active_connections	Gauge	Current connection count
error_rate	Derived	errors / total (e.g. from counters)
queue_depth	Gauge	Current backlog size

Lesson 2 Takeaway

Counters for totals (then rate); gauges for current state; histograms for latency and distributions. Labels = dimensions; keep cardinality low.

Lesson 3: Collection and Architecture

Metrics can be pulled (scraper hits your app) or pushed (app sends to a collector). Prometheus is pull-based; StatsD and many agents are push-based.

High-Level Flow (Pull Model — e.g. Prometheus)

Apps expose /metrics (or similar) in a standard format (e.g. Prometheus text).
Prometheus scrapes targets at an interval (e.g. 15s); stores in a time-series DB; evaluates alert rules; sends alerts to Alertmanager (dedupe, group, route to channels).
Grafana (or similar) queries Prometheus and builds dashboards.

Push Model

Apps push to a gateway or agent (e.g. StatsD, Prometheus pushgateway for batch jobs). The collector then stores or forwards. Use push when you cannot expose a scrape endpoint (e.g. short-lived jobs).

Lesson 3 Takeaway

Pull (Prometheus scrape) is common for long-lived services; push for batch or ephemeral workloads. Metrics need dimensions (service, instance, endpoint, status) for filtering and aggregation.

Lesson 4: Storage and Cardinality

Storage: Time-series DB (Prometheus TSDB, InfluxDB, VictoriaMetrics, etc.). Data is stored by series (metric name + labels); retention is typically 15 days to a few weeks; long-term storage (Thanos, Cortex) extends retention.
Cardinality: Each unique combination of metric + labels is a series. High-cardinality labels (e.g. user_id, request_id) multiply the number of series and can explode storage and query cost. Prefer low-cardinality labels: service, instance, method, status. For high-cardinality data, sample or aggregate; use logs for raw events.

Lesson 4 Takeaway

Labels = dimensions; cardinality = product of label value counts. Avoid high-cardinality labels in metrics; use logs for detailed, high-cardinality data.

Lesson 5: Alerting

Alert Rule (Concept)

Condition: e.g. rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
Duration: Require condition to hold for e.g. 5 minutes (for: 5m) to avoid flapping.
Annotations: Summary and description for runbooks and context.
Severity: Label alerts (e.g. critical, warning) for routing and prioritization.

Example (Prometheus-style)

YAML
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Error rate above 1%"

Alertmanager: Deduplicates, groups, and routes alerts to channels (email, Slack, PagerDuty). Inhibition and grouping reduce noise.

Best Practices

Avoid alert storms: Aggregate by service; use grouping and inhibition; tier alerts (P0/P1/P2).
Actionable: Every alert should have a runbook or clear next step.
Tune thresholds: Avoid rules that fire constantly or never; use error budget to set thresholds when possible.

Lesson 5 Takeaway

Alerts = condition + duration + annotations. Actionable alerts with runbooks; cardinality under control so alerting stays manageable.

Lesson 6: SLI, SLO, Error Budget

SLI (Service Level Indicator): A measurable indicator of behavior (e.g. "percentage of requests with status 2xx" or "percentage of requests with latency < 200ms").
SLO (Service Level Objective): The target for the SLI (e.g. "99.9% of requests succeed", "P99 latency < 200ms").
Error budget: 100% − SLO (e.g. 99.9% → 0.1% "budget" for failures). Use it to decide when to pause releases, invest in reliability, or accept risk.
SLA (Service Level Agreement): Contract with users (may include penalties). SLO is internal target; SLA is external commitment.

Lesson 6 Takeaway

SLI = what you measure; SLO = target; error budget = allowed failure; use it for release and reliability decisions.

Key Rules (Summary)

Metrics need dimensions (service, instance, endpoint, status) for filtering and aggregation.
Cardinality: Avoid high-cardinality labels; sample or aggregate when necessary.
Alerts must be actionable; each alert should have a runbook.
SLI/SLO: Define what you measure and your target; use error budget for release and reliability decisions.

What's Next

See Distributed Tracing and Java Profiling. See Slow Query and GC for performance monitoring.

Metrics & Monitoring System

Metrics & Monitoring System

Lesson 1: What Are We Measuring?

Lesson 1 Takeaway

Lesson 2: Metric Types

Common Metrics Table

Lesson 2 Takeaway

Lesson 3: Collection and Architecture

High-Level Flow (Pull Model — e.g. Prometheus)

Push Model

Lesson 3 Takeaway

Lesson 4: Storage and Cardinality

Lesson 4 Takeaway

Lesson 5: Alerting

Alert Rule (Concept)

Example (Prometheus-style)

Best Practices

Lesson 5 Takeaway

Lesson 6: SLI, SLO, Error Budget

Lesson 6 Takeaway

Key Rules (Summary)

What's Next

Technical Quiz

Question #1 of 10