← Back to the section

When a service breaks at night, the on-call engineer is woken by an alert. But which one? If an alert fires on every single error, the team quickly starts ignoring them. If an alert fires only once the service has already been unavailable for several hours, it's already too late.

Here we'll work out how to build an alerting system around SLOs: what an error budget is, why different time windows matter and how to implement all of this in Python with FastAPI and Prometheus.

What an SLO is and why you need it

Teams used to monitor a service on the principle of "everything works / everything is broken". That approach doesn't answer the main question: how well is the service working for users right now?

SLO (Service Level Objective) is a quantitative promise about the level of service. Not "we try", but "99.9% of requests are successful over a rolling 30-day window".

Why 99.9% and not 100%? Because 100% is unattainable — deployments, network failures, hardware. And most importantly: with a 100% target there's no instrument to make the decision "can we ship a risky change or not".

Error budget is how many errors the SLO allows. With a target of 99.9% over a 30-day window the error budget is 0.1%, or roughly 43 minutes of unavailability per month. If the budget is nearly exhausted, the team switches from developing new features to stabilization. If the budget is fine, you can take risks.

How to wire metrics into FastAPI

The source of SLI metrics in Python/FastAPI is the prometheus-fastapi-instrumentator library. It automatically registers the request counter http_requests_total and the histogram http_request_duration_seconds with the labels handler, method, status_code.

from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()

Instrumentator(
    should_group_status_codes=False,
    should_ignore_untemplated=True,
    excluded_handlers=["/health/live", "/health/ready", "/metrics"],
).instrument(app).expose(app, endpoint="/metrics", include_in_schema=False)

The should_group_status_codes=False parameter matters: without it all 5xx fall into one label and you won't be able to tell 500 from 503.

For business metrics, add counters and histograms through prometheus_client directly:

from prometheus_client import Counter, Histogram

ORDER_CREATED = Counter(
    "order_created_total",
    "Successful order creations",
    ["payment_method"],
)

ORDER_FAILED = Counter(
    "order_failed_total",
    "Failed order creations",
    ["reason"],
)

CHECKOUT_DURATION = Histogram(
    "checkout_duration_seconds",
    "End-to-end checkout latency",
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)

Naming convention: snake_case, the unit of measurement in the suffix (_seconds, _total, _bytes).

An important point: the label is called handler (its value is the path template, for example /orders/{order_id}), not uri as in Micrometer/Spring. Before writing PromQL queries, check the real label names through /metrics.

SLOs for specific endpoints

Not all requests are equally important. Usually you single out the critical endpoints and for each one define two SLOs: for availability and for latency.

An example set of SLOs for an order service:

EndpointAvailabilityLatency
POST /orders99.9% non-5xxp95 < 500ms
POST /payments99.95% non-5xxp95 < 1s
GET /orders/{id}99.95% non-5xxp95 < 200ms
GET /products/search99.5% non-5xxp95 < 800ms

An SLI (Service Level Indicator) is the current measured value. In PromQL for POST /orders:

# Availability — the share of non-5xx requests
sum(rate(http_requests_total{handler="/orders",method="POST",status_code!~"5.."}[30d]))
  /
sum(rate(http_requests_total{handler="/orders",method="POST"}[30d]))

# Latency — the 95th percentile
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{handler="/orders",method="POST"}[30d]))
)

For latency, always use percentiles (histogram_quantile), not the average. The average hides the tails: 90% of requests may be fast and 10% slow — the average will show "all good".

Multi-window burn rate: a fast and a slow signal

The classic mistake is a single alert of the form "error rate > 1% over the last 5 minutes". It doesn't distinguish a real incident from short-lived noise and doesn't tell you how fast the budget is being consumed.

The approach from the Google SRE Workbook is multi-window multi-burn-rate. The idea is simple: we look at the rate at which the budget is being burned over different windows.

Burn rate is how many times faster than normal the error budget is being burned:

burn_rate = (error_rate_in_the_window) / (1 - SLO_target)

For an SLO of 99.9% (budget = 0.001):

  • a burn rate of 14.4 over a 1h window means that in an hour 5% of the monthly budget burns — that's an incident;
  • a burn rate of 6 over a 6h window means slow degradation — you need to look into it, but not urgently;
  • a burn rate ≤ 1 is the normal rate, nothing scary.
groups:
  - name: orders.slo
    rules:
      - alert: OrdersSloFastBurn
        expr: |
          (
            sum(rate(http_requests_total{handler="/orders",method="POST",status_code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{handler="/orders",method="POST"}[1h]))
          ) > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
          team: order-service
        annotations:
          summary: "Orders SLO fast burn — potential outage"
          runbook: https://runbooks.internal/orders-slo-fast-burn

      - alert: OrdersSloSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{handler="/orders",method="POST",status_code=~"5.."}[6h]))
            /
            sum(rate(http_requests_total{handler="/orders",method="POST"}[6h]))
          ) > (6 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning
          team: order-service
        annotations:
          summary: "Orders SLO slow burn — degradation without an outage"
          runbook: https://runbooks.internal/orders-slo-slow-burn

Fast burn wakes the on-call engineer immediately. Slow burn creates a task for the next working day. The for field is mandatory — without it a single spike will raise the alarm.

Error budget exhaustion: a signal to the team

Separately from burn rate, you should set up an alert on the exhaustion of the budget itself over a rolling 30 days. This isn't "fix it urgently at night", it's a signal: for the coming weeks the team focuses on stability.

# How much budget is left (1 = all free, 0 = exhausted)
1 - (
  (1 - sum(rate(http_requests_total{handler="/orders",method="POST",status_code!~"5.."}[30d]))
        /
       sum(rate(http_requests_total{handler="/orders",method="POST"}[30d])))
  /
  (1 - 0.999)
)
- alert: OrdersErrorBudgetExhausted
  expr: <budget_remaining_expression> < 0.1
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Only 10% of the error budget remains — /orders"
    description: |
      The team switches from new features to reliability.
      Risky rollouts are paused until the budget recovers.
    runbook: https://runbooks.internal/orders-error-budget

Alerts beyond the SLO

An SLO measures success for users. But a service can degrade for reasons the SLO will only catch when it's too late. So alongside the SLO alerts you set up separate categories:

Infrastructure — when the service is overloaded but still coping:

  • process_resident_memory_bytes > 1.5G — a memory leak
  • event-loop lag > 100ms — a blocking call inside async code
  • uvicorn_workers_busy / uvicorn_workers_total > 0.9 — the workers are running out

Domain — the business logic behaves unexpectedly:

PRODUCT_CHECKOUT_BLOCKED = Counter(
    "product_checkout_blocked_total",
    "Checkout blocked due to inventory hold failure",
    ["reason"],
)
- alert: ProductCheckoutBlockedHigh
  expr: sum(rate(product_checkout_blocked_total[5m])) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High percentage of blocked orders — check the inventory"
    runbook: https://runbooks.internal/product-checkout-blocked

Resilience — the state of the circuit breaker. If it's open, requests go to a fallback path — the SLO is still fine, but the main dependency is unavailable, a problem is brewing.

Kafka consumer lag — if the service consumes events, queue lag directly affects users.

Common mistakes

An alert on every error in the logs. This is the main road to alert fatigue: the team starts ignoring them. Aggregate by type and use burn-rate alerts.

An SLO of 100%. There's nothing to operate with: any error formally breaks the target. Use 99.9% (43 minutes a month) or 99.95% for critical services.

An alert without a runbook. The on-call engineer gets a notification and doesn't know what to do. Every alert should contain a link to instructions: what to check, whom to call, how to roll back.

Only a single time window. A burn rate over 30 days alone reacts too slowly to a real incident. You need a short window (1h) for a fast signal and a long one (6h) for slow degradation.

Latency by the average. avg(http_request_duration_seconds) hides slow requests. Only percentiles show the real user experience.

In short

  • SLO is a quantitative target (for example, 99.9% non-5xx over a 30-day window). Error budget is the allowed number of errors, a prioritization instrument.
  • In Python/FastAPI metrics are wired in through prometheus-fastapi-instrumentator; business metrics — through prometheus_client directly.
  • The request label is handler (not uri as in Spring). Metric names — snake_case with a unit suffix (_total, _seconds).
  • Multi-window burn rate: fast burn (1h, rate > 14.4) — a critical alert; slow burn (6h, rate > 6) — a warning. The for field is mandatory.
  • Latency is measured with percentiles (histogram_quantile(0.95, ...)), not the average.
  • SLO alerts aren't the only ones. Alongside them you need alerts for infrastructure, domain, resilience and queue latency.
  • Every alert contains a link to a runbook. Without instructions an alert is useless.
  • Metrics in Python — prometheus-client, histograms, business counters, cardinality.
  • Tracing in Python — a manual span through a context manager, sampling, linking traces to alerts.
  • Health checks in Python — why liveness/readiness don't replace an SLO.
  • Logging in Python — structlog, contextvars, the trace_id → log → alert link.