← Back to the section

When Kubernetes decides to restart or update a pod, it sends the process a SIGTERM signal and waits for it to finish on its own. FastAPI with uvicorn can do this correctly — wait for in-flight requests and close connections. But without correct configuration of Kubernetes itself, some requests still fail: kube-proxy keeps sending traffic to the dying pod for several more seconds after shutdown begins.

In this article — which settings you need and why.

How Kubernetes stops a pod

When Kubernetes terminates a pod (during a deployment update or scale-down), it does so in several steps:

  1. Runs the preStop hook — if one is configured.
  2. In parallel, removes the pod from the list of active service addresses (endpoints).
  3. After preStop finishes, sends the process SIGTERM.
  4. Waits for the process to finish on its own.
  5. If the process hasn't finished within terminationGracePeriodSeconds — forcibly kills it with SIGKILL.

The problem without preStop: steps 2 and 3 happen almost simultaneously, but kube-proxy updates its routing tables asynchronously — this takes 5–15 seconds. All this time new requests go to a pod that has already begun shutting down and may not process them.

preStop: sleep 10 solves this problem: the pod "sleeps" for 10 seconds before receiving SIGTERM. During this time kube-proxy manages to update the routes, and new requests no longer reach the dying pod.

terminationGracePeriodSeconds: why 60, not 30

The default Kubernetes value is 30 seconds. That's not enough.

Let's look at how the time is spent:

T=0s    Kubernetes begins terminating the pod
T=0s    preStop sleep 10 starts
T=10s   preStop finished, the pod receives SIGTERM
T=10s+  FastAPI/uvicorn begins graceful shutdown:
          - switches readiness to 503 (removes the pod from traffic)
          - waits for in-flight requests to finish (25 seconds)
T=35s+  Connections to the DB, Kafka, etc. are closed
T=60s   If the process is still alive — SIGKILL

With terminationGracePeriodSeconds: 30, uvicorn is left with only 20 seconds after preStop. If requests take longer or resource closing is slow, the pod will get SIGKILL in the middle of the process.

The minimum budget: 10 seconds preStop + 25 seconds uvicorn graceful + time to close resources ≈ 40–50 seconds. We set 60 seconds with headroom.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: app
          image: order-service:1.4.2
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 10"]

Readiness and liveness — different checks with different meanings

Kubernetes checks the pod's state through probes. It's important to understand the difference:

  • readinessProbe — "is the pod ready to accept traffic?" On failure the pod is removed from endpoints but is not restarted.
  • livenessProbe — "is the pod alive at all?" On failure the pod is restarted.

This is critically important for graceful shutdown. On shutdown the pod must return 503 to readiness requests — so that Kubernetes removes it from traffic. But liveness must keep responding 200, otherwise Kubernetes will decide the pod has hung and restart it — and no correct shutdown will happen.

That's why you need two different endpoints:

from contextlib import asynccontextmanager
from fastapi import FastAPI
from starlette.responses import JSONResponse


class AppState:
    def __init__(self) -> None:
        self.ready: bool = False


app_state = AppState()


@asynccontextmanager
async def lifespan(app: FastAPI):
    app_state.ready = True
    yield
    app_state.ready = False  # signal for readiness — stop accepting traffic


app = FastAPI(lifespan=lifespan)


@app.get("/health/live")
async def liveness():
    return {"status": "alive"}  # always 200, even during shutdown


@app.get("/health/ready")
async def readiness():
    if not app_state.ready:
        return JSONResponse(status_code=503, content={"status": "not_ready"})
    return {"status": "ready"}

And the corresponding probe configuration in the Deployment:

spec:
  containers:
    - name: app
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        periodSeconds: 5
        timeoutSeconds: 2
        failureThreshold: 2
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        periodSeconds: 10
        timeoutSeconds: 2
        failureThreshold: 3
        initialDelaySeconds: 30

initialDelaySeconds: 30 on liveness is headroom for startup time. Python starts quickly, but if initializing the SQLAlchemy pool or connecting to Kafka takes time, then without a delay liveness might fire prematurely and restart a pod that hasn't come up yet.

How it works during shutdown

  1. The lifespan shutdown sets app_state.ready = False.
  2. /health/ready starts returning 503.
  3. Kubernetes sees two failed responses in a row (failureThreshold: 2 × periodSeconds: 5 = ~10 seconds).
  4. The pod is removed from endpoints, new requests no longer arrive.
  5. uvicorn waits for the current in-flight requests and finishes.

A full lifespan with resource closing

In a real service, in lifespan you need to close all resources — the database, Kafka connections, background tasks:

import asyncio
import logging
from contextlib import asynccontextmanager
from typing import AsyncGenerator

from aiokafka import AIOKafkaConsumer, AIOKafkaProducer
from fastapi import FastAPI
from sqlalchemy.ext.asyncio import AsyncEngine, create_async_engine
from starlette.responses import JSONResponse

logger = logging.getLogger(__name__)


class OrderServiceState:
    def __init__(self) -> None:
        self.ready: bool = False
        self.engine: AsyncEngine | None = None
        self.consumer: AIOKafkaConsumer | None = None
        self.producer: AIOKafkaProducer | None = None
        self._background_tasks: set[asyncio.Task] = set()


state = OrderServiceState()


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
    logger.info("order-service startup")
    state.engine = create_async_engine("postgresql+asyncpg://...")
    state.producer = AIOKafkaProducer(bootstrap_servers="kafka:9092")
    state.consumer = AIOKafkaConsumer("order.commands", bootstrap_servers="kafka:9092")
    await state.producer.start()
    await state.consumer.start()
    state.ready = True

    yield

    logger.info("order-service shutdown: SIGTERM received")
    state.ready = False  # first remove from traffic

    for task in list(state._background_tasks):
        task.cancel()
        try:
            await task
        except asyncio.CancelledError:
            pass

    await state.consumer.stop()
    await state.producer.stop()
    await state.engine.dispose()
    logger.info("order-service shutdown complete")


app = FastAPI(lifespan=lifespan)


@app.get("/health/live")
async def liveness():
    return {"status": "alive"}


@app.get("/health/ready")
async def readiness():
    if not state.ready:
        return JSONResponse(status_code=503, content={"status": "not_ready"})
    return {"status": "ready"}

Rolling deploy without losing requests

By default Kubernetes, when updating a deployment, may kill the old pod before the new one is ready to accept traffic. To avoid this, you configure the update strategy:

spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  • maxUnavailable: 0 — no pod can be removed until a new one is up.
  • maxSurge: 1 — one extra pod can be temporarily created above the specified replica count.

The update order with this configuration:

Start: 3 pods of version v1
→ 1 v2 pod is created (4 pods total)
→ The v2 pod passes readinessProbe → enters traffic
→ One of the v1 pods begins shutting down (preStop → SIGTERM → graceful shutdown)
→ The next v2 is created
→ ...

Without maxUnavailable: 0, Kubernetes may kill v1 before v2 appears, and at peak traffic there will be fewer replicas than needed — 503s are possible.

uvicorn: an explicit shutdown timeout

uvicorn must start with an explicit graceful shutdown timeout. Without it, it uses the default value — a forced kill of workers without waiting:

CMD ["uvicorn", "app.main:app",
     "--host", "0.0.0.0",
     "--port", "8080",
     "--workers", "1",
     "--timeout-graceful-shutdown", "25"]

Or via Python:

import uvicorn

if __name__ == "__main__":
    uvicorn.run(
        "app.main:app",
        host="0.0.0.0",
        port=8080,
        workers=1,
        timeout_graceful_shutdown=25,
    )

25 seconds is a value that fits into the budget with terminationGracePeriodSeconds: 60 and preStop: sleep 10 (50 seconds are left for uvicorn + resources).

Common mistakes

No preStop — kube-proxy doesn't manage to update the routes, new requests go to the dying pod and get 502.

terminationGracePeriodSeconds: 30 (the default value) — with preStop 10, uvicorn is left with 20 seconds. Long requests or slow resource closing → SIGKILL in the middle.

One /health for both probes — on shutdown we return 503, Kubernetes interprets it as "the pod is dead" and restarts it, breaking graceful shutdown.

Liveness checks the database connection — if the database is unavailable, liveness fails and Kubernetes restarts the pod. Liveness must always respond 200 while the process is alive; it doesn't check external dependencies.

timeout_graceful_shutdown not set in uvicorn — a forced kill without waiting for current requests.

In short

  • preStop: sleep 10 is mandatory — it gives kube-proxy time to remove the pod from the routes before SIGTERM.
  • terminationGracePeriodSeconds: 60 — the default of 30 isn't enough given preStop and uvicorn graceful.
  • Readiness and liveness — different endpoints with different behaviour on shutdown: readiness returns 503, liveness stays 200.
  • Liveness doesn't check external dependencies — only whether the process is alive.
  • maxSurge: 1, maxUnavailable: 0 — the new pod enters traffic before the old one shuts down.
  • --timeout-graceful-shutdown 25 is set explicitly in uvicorn.
  • app_state.ready = False is set first in the lifespan shutdown — removes the pod from traffic before closing resources.