When two services talk inside one process, a method call either returns a result or throws an exception — there is no third option. The moment a network appears between them, a third outcome shows up, the nastiest one: you don't know what happened. The request might not have arrived, might have arrived and been lost on the way back, or the server might have honestly done everything but the response got stuck in transit. A network is not "a slow method call" — it's a fundamentally different story, and a reliable backend is built on accepting that fact.
Let's go step by step through why the network is unreliable, why every call needs a timeout, how to retry requests safely, and what to do when a neighbour is clearly down. None of this is abstraction — it's a set of habits without which a service under load will eventually "hang" without a single error in the logs.
The network is unreliable by nature
Beginners carry a quiet assumption: "I called a service, so it will answer." Whole systems break on it. Back in the 90s engineers wrote down the "fallacies of distributed computing" — things that seem obviously true but are actually false. The first three matter most:
- The network is reliable. No. Cables get yanked, switches reboot, Wi-Fi flickers, packets drop. Any call may fail to arrive.
- Latency is zero. No. A response from the next rack is milliseconds; from another data center, tens or hundreds. It isn't free.
- Bandwidth is infinite. No. The channel is not made of rubber; a big response or a traffic spike will hit a ceiling.
An analogy is a phone call versus a conversation in one room. In the room you're heard for sure. On the phone the line can drop mid-sentence, and you don't know whether the other side heard your last words or not. A distributed system lives exactly in "phone call" mode, and the code must be ready for it. More on the channels themselves is in the articles on TCP and UDP and network connections.
A separate hard case is a network partition: the link between parts of the system is gone, but each part is alive and thinks the other side died. This isn't fully "solved" — you prepare for it.
A timeout on every network call
The main rule of reliability sounds boring but saves more systems than anything else: every network call must have a timeout. Always. By default many HTTP clients wait for a response forever — and that's a trap.
Picture this: your service calls a neighbour, and the neighbour hangs. Without a timeout, the thread serving the request stalls and waits. A second such request comes in — a second thread stalls. The thread pool and the connection pool run dry, and now your entire service stops responding — even though it's healthy; all its resources are simply tied up waiting on a dead neighbour. That's the very "it hung": it didn't fail with an error, it quietly stopped answering.
A timeout turns endless waiting into an honest, fast error you can handle. It helps to distinguish two timeouts:
- Connect timeout — how long we wait for the connection to be established (usually short, hundreds of milliseconds).
- Read/response timeout — how long we wait for a response once the connection is up.
Values are chosen from the dependency's real behaviour, not "make it big so it never fires". A timeout set to 60 seconds "just in case" doesn't protect anything — it just postpones the disaster.
Retries — only for idempotent operations
Since a call may fail to arrive, retrying it is the obvious move. Retries are a powerful tool, but they have a sharp edge.
You may only retry idempotent operations. Idempotency means that repeating an operation yields the same result as calling it once. Reading an order (GET) can be done ten times — nothing changes. But "charge the money" cannot be retried blindly: if the first request actually arrived and the money was charged but the response got lost, the retry will charge a second time.
The second sharp edge is a retry storm. The neighbour slows down, all clients hit a timeout together and all retry together — load doubles. The neighbour goes down completely, clients retry even more aggressively, and the retries finish off the already-fallen service, giving it no chance to recover. To avoid this, you retry not immediately and not head-on, but with exponential backoff (a growing pause) and jitter (random spread so clients don't strike in unison):
attempt = 0
while attempt < MAX_ATTEMPTS:
try:
return call() # one network call with a timeout
except RetriableError:
attempt += 1
if attempt == MAX_ATTEMPTS:
raise # give up, pass the error upward
base = MIN_DELAY * (2 ** attempt) # 0.2s, 0.4s, 0.8s, ...
sleep(base + random(0, base)) # backoff + jitter
Three rules of a sane retry: a finite number of attempts, a growing pause with jitter, and retrying only errors worth retrying (timeout, 503 — yes; 400 or 404 — no, the second time will be the same).
Idempotency and idempotency keys
Idempotent operations are easy — you can just repeat them. The problem is with the ones that change state and "shouldn't" be repeated: a payment, creating an order, sending an email. How do you safely retry a money transfer when it's unclear whether the first request arrived or not?
The answer is an idempotency key. The client generates a unique identifier for the operation (a UUID, say) and attaches it to the request, usually in a header. The server remembers processed keys. A request comes in with a new key — we execute it and record the result. A repeat comes in with the same key — we don't execute again, we return the saved result from the first time.
An analogy is a coat-check tag. You hand over your coat, get tag #17. Come back with tag #17 again — you won't get a second coat, you'll get the same one back. An idempotency key makes "charge the money" safe to retry: no matter how many times the client repeats the request with one key, the money is charged once.
This is exactly how payment APIs work: the client sends an Idempotency-Key, and resending the same payment — on a drop, a timeout, a retry — doesn't create a second payment.
Circuit breaker: stop hammering the fallen
Retries with backoff help with short hiccups. But if a neighbour is down solidly — a minute, five — continuing to knock on it is pointless and harmful: every call runs into the timeout, ties up a thread, and slows your service down for a request that's doomed anyway.
Here a circuit breaker helps — a "fuse", by analogy with the electrical one. It has three states:
- Closed — all is well, requests pass straight through. The breaker counts errors.
- Open — there have been too many errors; the breaker opens and for a while immediately rejects requests to this dependency without spending a timeout. Your service quickly returns an error or a fallback instead of hanging.
- Half-open — after a pause the breaker lets a probe request through. It succeeds — the breaker closes and traffic resumes; it fails again — the breaker opens once more.
The point is to give the fallen neighbour a breather and not turn its failure into your own. A circuit breaker and sensible retries are two sides of one coin and usually work as a pair.
A timeout budget across the call chain
The last piece of the puzzle shows up when calls line up in a chain: an API gateway calls the orders service, which calls the payments service, which calls the bank. Each has its own timeout, and they cannot be set independently.
If the outer call has a 2-second timeout while an inner service in the chain waits 5 seconds for its own neighbour, then the client outside has already given up and left, while inside the chain is still labouring over a response nobody needs — and holding resources. This is called a timeout budget: the whole operation has a shared time limit, and each call deeper in gets the remainder of it, not its own value pulled from the air.
The practical rule: the deeper a call is in the chain, the shorter its timeout. The outer layer is the most patient, the inner ones less and less, so that by the time the client loses hope, the whole chain has already wound down cleanly. Retries inside the chain also eat the budget: three attempts of a second each are already three seconds that must fit within the shared limit.
Where this applies
Everything above isn't theory — it's the daily toolkit of a backend developer. As soon as a service goes over the network (and it always does — to the database, the queue, a neighbouring service, an external API), these habits decide whether it survives someone else's failure or goes down with it. The engineering rules — timeouts, retry, circuit breaker, bulkhead — are collected in the resilience standard; to even see that a call is lagging or a breaker has opened, you need observability.
Where beginners stumble:
- They forget the timeout and wonder why a healthy service "hung". The client often waits forever by default — that's the first thing to check.
- They retry non-idempotent operations without an idempotency key — and get double payments and duplicate orders.
- They retry head-on, without backoff and jitter, giving a fallen neighbour a retry storm instead of a breather.
- They set one huge timeout "so it's definitely enough" — that's not protection but postponed degradation: the resources still hang, just longer.
- They retry
400/404— errors that on a repeat yield exactly the same thing. Only transient failures are worth retrying.
What to learn next
Next it helps to look at which channels all this plays out on: TCP and UDP — what guarantees delivery and what doesn't, and how a timeout behaves on different transports; network connections — why establishing a connection isn't free and how a connection pool relates to timeouts. From the application-layer side — HTTP: which status codes are worth retrying and which aren't, and how headers carry the idempotency key. And when there are many calls to spread around — load balancers, themselves part of the reliability picture. The engineering specifics for all of this are held by the resilience standard.