A single service instance eventually hits a ceiling: it handles so many requests per second and no more. And even if it copes, it will one day restart or crash, and then the service is down entirely. That's why a real system always runs several instances: three, ten, a hundred identical copies of one service. But the client has a single address it knocks on. Someone has to stand at the entrance and decide which copy each request goes to. That someone is the load balancer.
Let's look at how it works, how it differs from a reverse proxy, why people talk about L4 and L7, how it figures out that one of the copies has died, and what your backend service needs to handle because of all this.
Why a load balancer
Picture the entrance to a big store with ten checkout lanes. If every shopper picked a lane themselves, one queue would be twenty people deep while the lane next to it sat idle. So they put in an attendant who directs people: "you go to lane three, you to lane seven". They see all the lanes at once and spread the flow evenly.
A load balancer is exactly that attendant for network requests. Clients (browsers, mobile apps, neighbouring services) reach out to a single address, and the balancer hides a pool of identical instances behind it and spreads the requests among them. That buys you three things at once: scaling (add instances, get more throughput), fault tolerance (one copy dies, traffic goes to the live ones), and seamless updates (roll out a new version one instance at a time without dropping the service).
Reverse proxy versus load balancer
These two terms are often used almost interchangeably, and in practice they're frequently the same program. The difference is one of emphasis.
A reverse proxy is an intermediary that sits in front of services, accepts requests on the client's behalf, and forwards them inside. It's "reverse" because it works on the server side, unlike an ordinary (forward) proxy that sits on the client side. A reverse proxy can do a lot besides distribution: terminate encryption, serve a cache, compress responses, route requests to different services by address.
A load balancer is the function "spread requests across several identical instances". Usually the same reverse proxy performs it. Take the popular nginx: it's a reverse proxy and a load balancer at once — one program accepts traffic, distributes it across backends, and handles caching and encryption along the way.
The simplest way to hold this in your head: a reverse proxy is the role "intermediary at the entrance", and balancing is one of its jobs. In the cloud the balancer is often a separate managed service, but the idea is the same.
L4 versus L7
Remember the network layers from the OSI and TCP/IP models? Load balancers come in two types along exactly those layers — and it's a fundamental distinction.
An L4 balancer works at the transport layer. It operates on TCP connections and doesn't look inside: it sees only IP addresses and ports, but doesn't know whether HTTP or something else is in there. Its job is to take an incoming connection and hand the whole of it to one of the backends. Fast and cheap, because there's no content to parse. But it can't do much either: spread connections, and that's it.
An L7 balancer works at the application layer and understands HTTP. It reads the request: the method, the path /orders/42, headers, cookies. Thanks to that it can do what L4 can't:
- route by content — send requests for
/api/to one pool and/images/to another; - terminate TLS — decrypt HTTPS at the entrance so backends receive plain HTTP and don't spend effort on encryption;
- balance by individual requests rather than by connections: a single keep-alive connection may carry a hundred requests, and L7 spreads them across different backends.
The price of L7's flexibility is that it has to parse every request — a bit more expensive. In practice web services almost always use L7: it's smarter and speaks the language of HTTP.
Distribution algorithms
How does the balancer choose which instance to send the next request to? There are a few simple strategies:
- Round-robin — in turn: first request to backend #1, second to #2, third to #3, then back to #1. Simple and fair when all instances are equally powerful and requests cost roughly the same.
- Least-connections — the request goes where there are the fewest active connections right now. Good when requests differ: one answers instantly, another hangs for a minute. Round-robin might overload a busy instance, while least-connections accounts for load.
- By hash — the balancer takes some feature of the request (say, the client's IP or a value from the URL), hashes it, and always sends it to the same instance. This way one client reliably lands on one backend — which is what sticky sessions, below, need.
There are variations (weighted round-robin, where more powerful instances get more traffic), but the idea is the same: spread the flow sensibly.
Health checks: weeding out a dead instance
The balancer must know which instances are alive, otherwise it will keep stubbornly sending requests to a crashed one and clients will get errors. For that it regularly runs a health check. Usually that's a dedicated HTTP endpoint like /health that the balancer pings every few seconds.
As long as the instance answers 200 OK, it stays in rotation and receives traffic. The moment it stops answering or starts returning errors several times in a row, the balancer takes it out of rotation and stops sending requests. When the instance starts answering again, it's brought back. This way the pool heals itself: a failed copy is simply excluded, and the client keeps getting responses from the live ones. It's one of the foundations of a distributed service's reliability.
Sticky sessions and why stateless is better
Sometimes it's tempting to pin a client to one instance so all its requests go to the same backend. That's called a sticky session. Why? If the service keeps user state in its own memory — say, the contents of a cart or a logged-in session's data — then that client must return to where the data lives. Otherwise another instance "doesn't recognize" them.
The problem is that sticky sessions break the whole beauty of balancing. The instance with clients stuck to it crashes, and all their sessions vanish along with the memory. Load is spread unevenly. Rolling out an update smoothly gets harder too.
So the right path is to make the service stateless: don't keep state in an instance's memory, push it outside — into a database, into Redis, into a token on the client's side. Then any request can go to any instance, they're all interchangeable, sticky sessions aren't needed, and the balancer freely spreads traffic however it likes. It's one of the key properties of services that scale well.
Where this applies
A load balancer is an invisible but mandatory part of almost any production service. The moment there's more than one instance, someone has to spread traffic between them. For a backend developer this turns into a few practical requirements for their own service.
First, the service must be stateless. Don't hold user state in an instance's memory — then any request can go to any copy and the balancer isn't tied by the hands. State belongs in a database or a shared cache.
Second, graceful shutdown matters. When an instance shuts down (during an update or scale-down), it should first take itself out of the balancer's rotation, let its health check report "I'm no longer ready", wait for in-flight requests to finish, and only then stop. Otherwise clients whose requests were flying to it at the moment of shutdown get dropped connections. This is usually a combination of the health endpoint, keep-alive connections, and correct handling of the termination signal.
Third, it helps to know where the balancer lives in your infrastructure. In Kubernetes the role of L4 balancing inside the cluster is played by a Service, while the entrance from outside, with L7 routing by path and TLS termination, is an Ingress. On AWS these are managed balancers in front of your instances — how they fit into the network is covered in the article on networking in AWS.
Where beginners stumble:
- They keep the session in an instance's memory and are surprised the "user gets logged out". That's a sticky session cracking: the request went to another instance where the state isn't. The fix is to move state outside.
- They forget the health endpoint or make it too shallow. If
/healthalways answers200even when the database is down, the balancer will send traffic to a knowingly broken instance. - They stop the service abruptly, without graceful shutdown. The instance disappears before the balancer has taken it out of rotation, and some requests get dropped.
- They confuse L4 and L7 and expect URL routing from an L4 balancer. It doesn't look inside the connection and knows nothing about URLs — that needs L7.
What to learn next
A load balancer sits at the crossroads of several topics worth laying out side by side. Start with the OSI and TCP/IP models — then the split into L4 and L7 becomes obvious. Next, HTTPS and TLS, to understand what "terminating encryption at the balancer" means. Get comfortable with connections and pools: a balancer spreads exactly these connections and the requests inside them. And tie it all to reliability — health checks, taking instances out of rotation, and graceful shutdown are precisely about how a service survives the failure of individual parts.