Big Picture Overview
Before a single server can handle all your traffic, you face a fundamental problem: as your application grows, requests pour in from thousands of users simultaneously. A single server—no matter how powerful—has limits. It can only accept a finite number of connections, process a finite number of requests per second, and store a finite amount of data in memory.
A load balancer is the component that solves this problem. It sits between your clients and your backend servers, acting as a traffic controller. When a request arrives, instead of going directly to one server, it goes to the load balancer, which then decides which backend server should handle it. This allows you to scale horizontally—adding more servers to handle more traffic—rather than just buying bigger and bigger machines.
The load balancer doesn’t just distribute traffic randomly. It’s intelligent: it monitors the health of each backend server, ensures no single server is overwhelmed, and can gracefully remove failed servers from the rotation. It’s foundational to building systems that are both scalable and resilient.
In the context of modern system architecture, the load balancer sits at a critical position in your infrastructure. It’s typically one of the first components clients interact with, and its decisions ripple through the entire system.
Core Concepts Explained Simply
What Is a Load Balancer?
A load balancer is a network device or software service that distributes incoming requests across multiple backend servers. Think of it like a receptionist at a busy office: instead of all visitors going to one person, the receptionist directs them to whoever is available.
The Two Main Types of Load Balancers
Load balancers operate at different layers of the network, each with different trade-offs:
Application Load Balancer (Layer 7 / HTTP(S))
An ALB understands HTTP and HTTPS protocols. It can read the content of HTTP requests—including headers, paths, hostnames, and even query parameters—to make routing decisions. For example, it can route requests to /api/* to one set of servers and requests to /images/* to another set optimized for serving static files.
- Advantage: Intelligent routing based on application data
- Disadvantage: Slightly higher latency because it must examine request content
Network Load Balancer (Layer 4 / TCP/UDP)
An NLB operates at the transport layer and only looks at IP addresses and port numbers. It doesn’t understand HTTP at all. It simply forwards TCP or UDP packets to a selected backend server.
- Advantage: Extremely fast with minimal latency; can handle millions of connections per second
- Disadvantage: No application-level intelligence; cannot make routing decisions based on HTTP headers or paths
Practical reality: For most web applications (HTTP/HTTPS), use an ALB. For high-throughput, low-latency scenarios (gaming, real-time messaging, video streaming), use an NLB.
Load Balancing Algorithms
The algorithm determines which backend server receives each request. Common algorithms include:
Round Robin
Requests are sent to servers in sequence: Server 1, Server 2, Server 3, Server 1, Server 2, etc. Simple and fair if all servers have equal capacity.
Weighted Round Robin
If servers have different capacities (one powerful machine and two smaller ones), you assign weights. A server with weight 2 gets twice as many requests as a server with weight 1.
Least Connections
The load balancer sends new requests to the server with the fewest active connections. Useful when requests have varying duration. A server handling 5 long-running requests should get fewer new requests than a server handling 20 short requests.
IP Hash (Source IP)
Requests from the same client IP always go to the same backend server. Ensures session affinity—useful if your backend servers maintain in-memory session state. However, it can create imbalance if traffic comes from a few large sources.
Resource-Based (Adaptive)
The load balancer periodically queries each server for its CPU, memory, and I/O metrics, then routes new requests to the healthiest servers. Requires agents running on backend servers but provides the most balanced distribution in heterogeneous environments.
Which to choose? For most cases, use Least Connections with weights. It handles real-world variability well.
Health Checks and Failover
A load balancer must know which servers are alive and healthy. It does this by periodically sending health check probes to each backend server:
- An HTTP(S) load balancer might send a GET request to
/health - A TCP load balancer might attempt to establish a connection to a specific port
If a server fails consecutive health checks (e.g., 3 failures in a row), the load balancer stops routing new traffic to it. If it recovers (passes 3 consecutive checks), it’s brought back into rotation. This is automatic and requires no human intervention.
How Load Balancers Integrate into a System
┌─────────────────────────────────────────────────────────────┐
│ CLIENTS │
│ (Browsers, Mobile Apps, APIs, etc.) │
└─────────────────────────────────────────────────────────────┘
│
│ HTTP/HTTPS Request
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LOAD BALANCER │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Selects backend server based on algorithm │ │
│ │ Health checks backend servers regularly │ │
│ │ May terminate TLS encryption (see below) │ │
│ │ Routes request to healthy server │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Server 1 │ │ Server 2 │ │ Server 3 │
│ (Active) │ │ (Active) │ │ (Standby) │
│ │ │ │ │ │
│ Application │ │ Application │ │ Application │
│ Process │ │ Process │ │ Process │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└───────────┼────────────────────┘
│
▼
┌─────────────────────────┐
│ Backend Datastore │
│ (Database, Cache) │
└─────────────────────────┘
Request Flow
- Client initiates request: A user’s browser sends an HTTP request to your service’s domain.
- DNS resolves to load balancer: The domain name resolves to the load balancer’s IP address.
- Load balancer receives request: The load balancer accepts the connection and examines the request (in the case of ALB) or simply checks the TCP connection (in the case of NLB).
- Load balancer selects backend: Using its configured algorithm, it chooses a healthy backend server.
- Load balancer forwards request: It opens a new connection to the backend server and sends the request.
- Backend processes request: The server handles the request, queries the database if needed, and sends a response back to the load balancer.
- Load balancer returns response: The load balancer forwards the response back to the client.
Key Integration Points
With clients: The load balancer is the single IP address clients connect to. From the client’s perspective, there’s only one service.
With backend services: The load balancer maintains persistent information about each backend—its IP, port, health status, and current connection count.
With infrastructure: The load balancer itself must be redundant. If the load balancer fails, the entire system becomes unreachable. This is solved by having a pair of load balancers (primary and standby) that share a floating IP address. If the primary fails, the standby automatically takes over.
With external systems: For global deployments, you might use a global load balancer (or DNS-based routing) that sits in front of regional load balancers, directing users to the geographically closest datacenter.
TLS Termination
The Problem: Encrypted Traffic
When a client connects to your service using HTTPS, the data is encrypted using TLS (Transport Layer Security). Encryption and decryption require computational work—for every request, the server must:
- Decrypt the incoming request
- Process it
- Encrypt the outgoing response
This CPU work adds latency and reduces the number of requests a server can handle per second.
The Solution: Termination at the Load Balancer
Modern load balancers can terminate TLS connections, meaning they decrypt the request before sending it to the backend server. Here’s what this looks like:
┌──────────────┐
│ CLIENT │
│ │
│ HTTPS │
│ (Encrypted) │
└──────────┬───┘
│
▼
┌──────────────────────────┐
│ LOAD BALANCER │
│ ┌──────────────────┐ │
│ │ TLS Certificate │ │
│ │ Decrypts HTTPS │ │
│ └──────────────────┘ │
└────────────┬─────────────┘
│
┌────────┴────────┬────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Server 1│ │Server 2│ │Server 3│
│ │ │ │ │ │
│ HTTP │ │ HTTP │ │ HTTP │
│(Plain) │ │(Plain) │ │(Plain) │
└────────┘ └────────┘ └────────┘
Benefits
- Offloads CPU work: Backend servers focus on business logic, not encryption
- Centralized certificate management: Update one certificate on the load balancer instead of updating it on every backend server
- Simplified security policies: Configure TLS version and cipher suites in one place
- Better performance: Faster request handling with less latency
Trade-off: End-to-End Encryption
If you terminate TLS at the load balancer, traffic between the load balancer and backend servers is unencrypted. This is usually fine because:
- This traffic stays within your private network (or VPC)
- The load balancer and backends are under your control
However, if your backends are geographically distributed or traffic crosses untrusted networks, you might keep TLS encryption end-to-end (terminate TLS at the backend instead).
Scaling Load Balancers
The Single Load Balancer Problem
A load balancer solves the “too much traffic” problem for backend servers. But the load balancer itself can become a bottleneck and a single point of failure.
What if the load balancer fails? All traffic stops. No client can reach any backend server. This is unacceptable for production systems.
Solution: Redundant Load Balancers
The standard approach is to deploy load balancers in pairs:
┌───────────────────────────────────────────────────────────┐
│ DNS Record │
│ Points to: 203.0.113.50 (Floating IP) │
└───────────────────────────────────────────────────────────┘
│
┌───────────────────┴───────────────────┐
│ │
▼ ▼
┌────────────────────┐ ┌────────────────────┐
│ PRIMARY LB │ │ SECONDARY LB │
│ 203.0.113.50 │ │ 203.0.113.51 │
│ (ACTIVE) │ │ (STANDBY) │
│ │ │ │
│ Health: PASSING │ │ Health: PASSING │
└────────────────────┘ └────────────────────┘
│ │
│ │
└──────────────┬─────────────────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
Backend Servers
(Database, Cache, etc.)
Floating IP Failover Mechanism:
If PRIMARY fails, the floating IP 203.0.113.50
automatically switches to SECONDARY.
How Failover Works
- The load balancers continuously monitor each other’s health using a heartbeat protocol
- If the primary load balancer stops responding, the secondary detects this (usually within seconds)
- The secondary takes over the floating IP address
- New client connections route to the secondary (now active)
- Existing connections may be lost, but the service recovers quickly
- When the primary recovers, it can failback (or stay in standby depending on configuration)
Scaling Horizontally
If your load balancer is handling so much traffic that it’s becoming a bottleneck, you can deploy multiple load balancers:
- Regional scale: Multiple load balancers within a region, using a network-level load balancer or anycast routing to distribute traffic among them
- Global scale: Load balancers in different regions, with DNS or a global load balancer directing users to the nearest region
This is less common for the load balancer layer itself (since load balancers are quite efficient) but necessary for truly massive traffic volumes.
Global Availability
The Problem: Single Datacenter Limitations
A single datacenter has inherent risks:
- Geographic latency: Users far from the datacenter experience high latency
- Single point of failure: A datacenter outage (natural disaster, power failure) takes down your entire system
- Compliance: Some regulations require data to be stored in specific regions
Solution: Geographically Distributed Load Balancers
Deploy load balancers and backends in multiple regions worldwide:
┌─────────────────────────────────────────────────────────────┐
│ GLOBAL DNS / GSLB │
│ (Geographic Load Balancer or DNS-based routing) │
└─────────────────────────────────────────────────────────────┘
│ │ │
│ Route to closest │ Route to closest │ Route to closest
│ healthy region │ healthy region │ healthy region
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│US East (N.V) │ │EU (London) │ │APAC (Tokyo) │
│ │ │ │ │ │
│ LB ──┐ │ │ LB ──┐ │ │ LB ──┐ │
│ │ │ │ │ │ │ │ │
│ Servers │ │ Servers │ │ Servers │
│ │ │ │ │ │ │ │ │
│ Database │ │ Database │ │ Database │
└──────────────┘ └──────────────┘ └──────────────┘
(Primary) (Read-only) (Read-only)
or Replica Replica Replica
How Global Load Balancing Selects a Region
When a user makes a request, the global load balancer (or DNS system) decides which regional load balancer to route them to based on:
- Geographic proximity: Route users to the nearest datacenter (lowest latency)
- Health checks: Only route to regions with healthy services
- User preferences: Route based on IP geolocation, latency measurement, or explicit user choice
- Load distribution: Distribute traffic evenly across regions if capacity allows
Data Replication Considerations
Geographic distribution introduces data consistency challenges:
- Primary-replica setup: One region has the primary database (accepts writes), others have replicas (read-only). Writes are slower because they must replicate to other regions.
- Multi-master replication: All regions can accept writes, but this introduces complexity (conflict resolution, eventual consistency).
- Read-local, write-primary: Users read from their nearest region (fast) but writes go to the primary region (slower but consistent).
Common Use Cases
Use Case 1: High-Traffic E-Commerce Platform
An e-commerce site experiences 10,000 requests per second during peak shopping days. A single server can handle ~100 requests per second, so you need ~100 backend servers.
How load balancing solves this:
- Deploy a load balancer in front of 100 servers
- Use a Least Connections algorithm to distribute traffic based on actual server load
- Configure TLS termination at the load balancer to reduce backend CPU usage
- Deploy load balancers in pairs for high availability
- Use an ALB to route static content (images, CSS) to optimized servers and dynamic content to application servers
Outcome: The platform scales horizontally. As traffic grows, add more backend servers without changing client code.
Use Case 2: Real-Time Gaming Server
A multiplayer game needs to handle millions of concurrent players with extremely low latency (players expect <100ms response times).
How load balancing solves this:
- Use Network Load Balancers (NLB) for ultra-low latency (no HTTP parsing overhead)
- Use IP Hash or Source IP algorithm so players always connect to the same game server (maintains game state consistency)
- Deploy NLBs globally so players connect to their nearest regional server
- Avoid TLS termination at the load balancer; instead, use TLS directly on game servers (or use alternative protocols like QUIC)
Outcome: Players experience consistent, low-latency gameplay regardless of player count.
Use Case 3: Mobile Banking App (High Availability + Compliance)
A banking app must maintain 99.99% uptime and comply with regulations requiring data residency in specific countries.
How load balancing solves this:
- Deploy redundant load balancers in each country/region
- Use TLS termination at the load balancer with strict security policies (only modern TLS versions, strong ciphers)
- Configure health checks to immediately detect server failures and reroute traffic
- Use a primary-replica database setup: all writes go to the primary region, reads can come from local replicas
- Periodically test failover to ensure it works
Outcome: The system is highly available and resilient to individual server failures. Data stays in compliant jurisdictions.
Trade-offs and Limitations
What Load Balancers Are Good At
✅ Distributing traffic: Evenly spreads load across servers, preventing overload
✅ Detecting failures: Automatically removes unhealthy servers and reroutes traffic
✅ Horizontal scaling: Enables systems to grow by adding more servers, not bigger servers
✅ Geographic distribution: Can direct users to nearby datacenters for low latency
✅ Offloading TLS: Removes encryption overhead from backend servers
What Load Balancers Are Bad At
❌ Stateful requests without planning: If your backend servers store session state in memory, sessions are lost when a server fails or a user is routed to a different server. Solution: Use sticky sessions (route same client to same server) or external session storage (Redis, Memcached).
❌ Maintaining connection state during updates: Rolling out a new version of your backend code requires careful handling. Long-lived connections (WebSockets, streaming) may be interrupted. Solution: Use connection draining (gracefully close existing connections before removing the server).
❌ Solving database bottlenecks: A load balancer distributes requests to databases too, but the database itself becomes a bottleneck. Load balancers alone can’t fix this. Solution: Use database replication, caching, or sharding.
❌ Preventing DDoS attacks at the application level: Load balancers can distribute traffic, but they can’t distinguish between legitimate requests and attack traffic. Solution: Use DDoS mitigation services in front of the load balancer.
Common Mistakes
Mistake 1: Single Load Balancer in Production
A single load balancer becomes a single point of failure. Always deploy in pairs with automatic failover.
Mistake 2: Not Configuring Health Checks Properly
If health checks are too lenient, failed servers stay in rotation. If they’re too strict, healthy servers are removed. Tune based on your application’s actual response times.
Mistake 3: Assuming Load Balancer Solves Scalability
Load balancers distribute requests, but if your backend servers or database can’t handle the total load, adding a load balancer doesn’t help. Scale each layer independently.
Mistake 4: Forgetting About Load Balancer Limits
Modern load balancers can handle millions of connections, but they have limits. As you grow, monitor load balancer metrics (connections per second, bandwidth) and upgrade or scale horizontally if needed.
Mistake 5: Overcomplicating Routing Logic
Simple algorithms (Round Robin, Least Connections) work well for most cases. Complex adaptive algorithms introduce operational complexity. Keep it simple unless you have a specific need.
How This Fits into an Architect’s Mental Model
As a system architect, think about load balancers as a scaling abstraction layer. They solve a specific problem: distributing load across multiple servers. But they don’t solve all scaling problems.
Three Key Mental Models
1. The Scaling Hierarchy
When your system gets overloaded, you scale different layers:
Overloaded Request Handling?
↓
Use Load Balancer → Distribute across more servers
↓
Overloaded Database?
↓
Use Database Replication → Read replicas in different regions
↓
Overloaded Cache?
↓
Distribute Cache Layer → Multiple cache nodes with sharding
Load balancers address the first level, but you’ll encounter other bottlenecks as you scale.
2. Availability Through Redundancy
Every critical component needs redundancy:
Critical Path:
Client → Load Balancer → Server → Database
Redundant Path:
Client → [LB Primary + LB Secondary] → [Servers 1,2,3...] → [DB Primary + Replicas]
The load balancer is part of the critical path. Its failure must be mitigated with a backup.
3. Trade-off Between Latency, Consistency, and Complexity
When designing a global system:
- Low latency: Route users to nearby datacenters (geographic distribution)
- Strong consistency: All writes go to one primary region (slower for distant users)
- Simplicity: Single region with replicated load balancers (easier to operate)
You can’t have all three. Choose based on your requirements.
Design Decisions
When to introduce a load balancer:
- When a single server can’t handle your peak traffic
- When you need automatic failover and high availability
- When you’re deploying across multiple zones or regions
When to avoid complexity:
- Don’t use geographic load balancing if you don’t have data residency requirements or latency constraints
- Don’t over-engineer health checks; simple HTTP checks usually suffice
- Don’t implement complex sticky-session logic if external session storage (Redis) is available
Monitoring and Observability:
Once you introduce a load balancer, monitor:
- Request distribution: Are all servers receiving roughly equal traffic?
- Health check failures: Are servers being incorrectly marked unhealthy?
- Load balancer performance: Is the load balancer itself becoming a bottleneck?
- Regional failover: Test and verify geographic failover periodically
Key Takeaway
A load balancer is one of the first scalability tools you’ll use, but it’s not a silver bullet. It solves the problem of distributing traffic across servers, but each layer of your system (backend servers, database, cache) needs its own scaling strategy. Think of the load balancer as the entry point to that scaling journey, not the endpoint.
Summary
Load balancers distribute incoming traffic across multiple backend servers, enabling horizontal scaling and improving availability. They operate at different network layers (Application Layer 7 for HTTP(S) or Network Layer 4 for TCP/UDP), use various algorithms to select backend servers, and automatically detect and isolate failed servers.
Key architectural considerations: deploy load balancers in redundant pairs to avoid single points of failure, use TLS termination to offload encryption work, configure appropriate health checks, and remember that load balancers are just one layer in a scalable system. As your system grows, you’ll need to scale your database, cache, and other components independently.
The most common mistake is treating a load balancer as a complete solution to scalability—it’s not. It’s a necessary foundation for building systems that can grow horizontally, but it works best when combined with thoughtful design of backend services, database architecture, and global deployment strategies.
System Design Interview Questions
Overview
This comprehensive guide contains 8 in-depth system design interview questions on load balancers, ranging from foundational concepts to advanced architectural challenges. Each question includes a detailed ideal answer that demonstrates deep technical understanding, practical architectural thinking, and awareness of trade-offs—the hallmarks of senior-level system design interviews.
Question 1: Design a Load Balancer from Scratch
Question: Design a load balancer that can handle millions of requests per second from clients to a pool of backend servers. Explain the key components, algorithms you’d use, and trade-offs you’d consider.
Ideal Answer:
A load balancer is a network device or software component that distributes incoming client requests across multiple backend servers to optimize resource utilization, reduce latency, and provide fault tolerance.
Architecture Overview:
Clients (Millions) │ ▼┌──────────────────────┐│ Load Balancer ││ ┌──────────────────┐ ││ │ Listener │ ││ │ (Port 443 HTTPS) │ ││ └──────────────────┘ ││ │ ││ ┌──────────────────┐ ││ │ TLS Termination │ ││ │ Decrypts HTTPS │ ││ └──────────────────┘ ││ │ ││ ┌──────────────────┐ ││ │ Routing Logic │ ││ │ (Algorithm) │ ││ └──────────────────┘ ││ │ ││ ┌──────────────────┐ ││ │ Health Checks │ ││ │ Monitor backends │ ││ └──────────────────┘ │└──────────────────────┘ │ ┌───┼───┬─────────┐ ▼ ▼ ▼ ▼ Server1 Server2... ServerN (Healthy)
Key Components:
1. Listener – Accepts incoming connections on a specific port (e.g., 443 for HTTPS), maintains connection state and tracks active connections per backend server
2. TLS Termination Module – Decrypts HTTPS traffic at the load balancer, encrypts responses, offloads CPU work from backends, and centralizes certificate management
3. Routing Algorithm – Determines which backend server handles each request:
| Algorithm | Mechanism | Best For | Trade-off |
|---|---|---|---|
| Round Robin | Sequential selection | Equal capacity servers | No load awareness |
| Least Connections | Routes to server with fewest active connections | Long-lived requests | Requires connection tracking |
| Weighted Round Robin | Accounts for server capacity differences | Heterogeneous servers | Manual weight configuration |
| IP Hash | Hashes client IP to consistent server | Session affinity | Can create imbalances |
| Least Response Time | Combines latency + connection count | Variable request durations | Higher LB overhead |
Recommendation: For millions of QPS, use Least Connections with weights—it balances real-world variability and scales well.
4. Health Check Module – Periodically probes servers to verify health:
- Failure threshold: typically 3-5 consecutive failures before removal
- Recovery threshold: 3-5 successes before re-adding
- Check interval: 5-10 seconds (balance detection speed vs. overhead)
Connection Pooling and Reuse:
Client Request 1 → LB → Backend Server 1 (Connection Pool)Client Request 2 → LB → Backend Server 2 (Connection Pool)Client Request 3 → LB → Backend Server 1 (Reuses connection)Benefits: Reduces TCP handshake overhead, reduces backend CPU usage, improves latency for high-QPS scenarios
Trade-offs and Design Decisions:
| Decision | Why | Trade-off |
|---|---|---|
| Terminate TLS at LB | Offload CPU, centralize certificates | Unencrypted traffic LB↔Backend (acceptable in private network) |
| Use Least Connections | Adapts to real load, handles variable times | Requires connection state tracking |
| Health checks every 5s | Balance detection latency vs. overhead | Takes ~15s to detect failure (3 failures × 5s) |
| Connection pooling | Reduces TCP handshake overhead | Complexity in connection lifecycle management |
Handling Load Balancer Bottlenecks:
For millions of QPS, a single load balancer will eventually saturate. Solutions include:
- Redundant LBs: Deploy 2-3 load balancers in active-active or active-passive mode
- Layer load balancers: Use L4 LB (fast) in front of multiple L7 LBs (intelligent)
- Distributed LB: Use hash-based client assignment to route clients to different LBs via DNS or Anycast
Question 2: Application Load Balancer vs Network Load Balancer
Question: Your company is building three services: (1) a web e-commerce platform, (2) a low-latency gaming backend, and (3) a financial transaction system. Which load balancer type would you recommend for each, and why?
Ideal Answer:
Understanding the Layers:
- Network Load Balancer (NLB, L4): Operates at the transport layer (TCP/UDP). Routes based on IP address and port only. Extremely fast, can handle millions of connections.
- Application Load Balancer (ALB, L7): Operates at the application layer (HTTP/HTTPS). Can read HTTP headers, paths, hostnames. Slower but more intelligent.
1. E-Commerce Platform → Application Load Balancer
Why:
- Need intelligent routing:
/api/*→ API servers,/images/*→ image servers - Can route based on HTTP headers and hostnames
- TLS termination simplifies backend architecture
- Latency less critical (users tolerate 50-100ms for page loads)
Configuration Example:
Routing Rules: Host: www.ecommerce.com/api/* → API Server Pool Host: www.ecommerce.com/images/* → Image Server Pool Header: X-Admin = true → Admin Server Pool
Trade-off: Slightly higher latency (ALB must parse HTTP), but acceptable for e-commerce.
2. Low-Latency Gaming Backend → Network Load Balancer
Why:
- Games require sub-100ms latency (ideally <50ms). Every millisecond matters.
- UDP-based protocols (common in games) require L4 load balancing
- NLB processes traffic with minimal overhead → lower latency
- Can handle millions of concurrent connections
Implementation Detail:
NLB Configuration: Protocol: UDP Algorithm: IP Hash (ensure same player always goes to same server)Why IP Hash for games? - Game state stored on server - If player connects to different server, character position is lost - Sticky routing ensures consistency
Trade-off: No application-level intelligence; if you need complex routing, use a secondary routing layer.
3. Financial Transaction System → Network Load Balancer with Sticky Sessions
Why:
- Financial transactions often require strict ordering and consistency
- NLB provides low latency for time-sensitive operations
- Sticky sessions: Route all transactions from a client to the same server to maintain ordering invariants
- Extremely high throughput capable
Configuration:
NLB Configuration: Protocol: TCP (TLS pass-through, not termination) Sticky Sessions: IP-based routingWhy not terminate TLS at NLB? - Unencrypted traffic LB↔Backend risky for financial data - Instead: Pass-through TLS, backends handle encryption
Comparison Table:
| Criteria | ALB | NLB |
|---|---|---|
| OSI Layer | L7 (Application) | L4 (Transport) |
| Protocols | HTTP, HTTPS | TCP, UDP |
| Latency | 10-100ms overhead | <10ms overhead |
| Throughput | ~100,000 RPS per instance | >1M RPS per instance |
| Intelligent Routing | Yes (headers, paths, hostnames) | No (only IP/port) |
| Best For | Web apps, microservices, APIs | Gaming, real-time data, high-throughput |
Question 3: Handle Session State with Load Balancers
Question: Your application stores user session data in memory on each backend server. When you add a load balancer and scale to 3 servers, requests from the same user can go to different servers. How would you solve session loss?
Ideal Answer:
The Problem:
Request 1: User logs in → Server 1 Server 1 stores: {user_id: 123, cart: [item1], logged_in: true}Request 2: User adds item → Server 2 (different server!) Server 2 has no session info Result: "Please log in" error
This is the classic session state problem in load-balanced systems.
Solution 1: Sticky Sessions (Session Affinity)
The load balancer remembers which server each client connects to and always routes them there.
Implementation A: IP-Based Sticky Sessions
Hash(client_IP) → Server 1All requests from 192.168.1.100 → Server 1Pros: Simple, no additional data structuresCons: Mobile users change IP (sessions lost), unbalanced load
Implementation B: Cookie-Based Sticky Sessions
Response from Server 1: Set-Cookie: SERVERID=server-1; Path=/Subsequent requests: Cookie: SERVERID=server-1 LB reads cookie, routes to that serverPros: Works even if client IP changesCons: Still loses session if server fails
Pros/Cons: Simpler but causes uneven load distribution and failover complexity.
Solution 2: External Session Storage (Recommended for Scale)
Move session data out of backend servers into a separate, shared store:
┌──────────────────────────────────────┐│ Load Balancer │└──────────────┬───────────────────────┘ │ ┌───┼───┬──────────┐ ▼ ▼ ▼ ▼┌────────────────┐ ┌────────────────┐│ Server 1 │ │ Server 2 ││ (Stateless) │ │ (Stateless) │└────────────────┘ └────────────────┘ │ │ └───────┬───────┘ ▼ ┌──────────────────┐ │ Redis Cache │ │ Session Store │ │ {user_123: │ │ {cart: [..], │ │ logged_in:T}} │ └──────────────────┘
Implementation Example:
@app.route('/add-to-cart', methods=['POST'])def add_to_cart(): item = request.json['item'] user_id = request.headers.get('User-ID') # Read from Redis (any server can access) session = redis.get(f"session:{user_id}") session['cart'].append(item) # Write back to Redis (persisted across servers) redis.set(f"session:{user_id}", session, ex=3600) return {"status": "added", "cart": session['cart']}
Pros: Perfect scalability, fault tolerance with replication, handles migrations easily
Cons: External dependency, slightly higher latency, cache becomes potential single point of failure.
Solution 3: Stateless Applications (Best Practice)
Store session data in the client (signed JWT token) or use event sourcing:
Request 1: User logs in → Server 1 Server 1 creates JWT: eyJhbGc...payload...signature Returns: {token: "eyJhbGc...", user_id: 123}Request 2: User adds item → Server 2 Client sends JWT: Authorization: Bearer eyJhbGc... Server 2 verifies JWT signature (no state lookup) Reads user info from JWT Result: Works seamlessly!
Pros: Perfect scalability, no external dependencies, great for microservices
Cons: Token size, can’t revoke instantly, don’t trust client-side data.
Comparison Table:
| Approach | Complexity | Scalability | Fault Tolerance | Best For |
|---|---|---|---|---|
| Sticky Sessions | Low | Medium (uneven load) | Poor (server failure) | Small systems |
| External Cache | Medium | Excellent | Good (with replication) | Web apps, shopping carts |
| Stateless (JWT) | Low | Excellent | Excellent | Mobile apps, microservices |
Recommendation: For small apps, use sticky sessions; for growing apps, use Redis; for distributed systems, use stateless JWT.
Question 4: Load Balancer as Single Point of Failure
Question: Your system has a load balancer in front of 10 backend servers. What happens if the load balancer fails? How would you design for high availability?
Ideal Answer:
The Problem:
A single load balancer is a single point of failure. If it crashes:
Client → [DEAD LB] → No backend reachableResult: Entire service down, even though backends are healthy
This is unacceptable for production.
Solution: Redundant Load Balancers with Failover
┌────────────────────────────────────────────────────────┐│ Floating IP: 203.0.113.50 ││ (Shared by both load balancers via failover) │└────────────────────────────────────────────────────────┘ │ ┌───────────┴───────────┐ │ │ ▼ ▼┌──────────────────┐ ┌──────────────────┐│ LB-1 (Primary) │ │ LB-2 (Standby) ││ 203.0.113.50 │ │ 203.0.113.51 ││ Status: ACTIVE │ │ Status: PASSIVE │└──────────────────┘ └──────────────────┘ │ Heartbeat (every 1 second) └────────→
How Failover Works:
Timeline:t=0s: LB-1 crashest=1s: LB-2 expects heartbeat, doesn't receivet=2s: LB-2 expects heartbeat, still nothingt=3s: LB-2 fails heartbeat 3 times in a row → LB-2 claims the floating IP 203.0.113.50t=3-5s: Existing connections to LB-1 get reset Clients retry, now go to LB-2t=5-10s: DNS propagation (if using DNS failover) New clients query DNS → get LB-2's IPResult: Service back online within 3-10 seconds
Implementation Approaches:
Approach 1: Virtual IP with Keepalived (On-Premise)
Uses VRRP protocol for automatic failover:
LB-1: 192.168.1.10LB-2: 192.168.1.11Floating VIP: 192.168.1.50 (shared)Keepalived continuously exchanges heartbeatsOn failure, automatically updates ARP table
Pros: Open-source, widely supported
Cons: Self-managed, requires operational expertise.
Approach 2: Cloud-Native (AWS, GCP, Azure)
Modern cloud providers handle this automatically:
AWS ALB: - Deployed across 2+ availability zones - Each AZ has separate LB instance - If one AZ down, traffic reroutes to another - No manual failover needed
Pros: Transparent, highly available, managed by provider
Cons: Vendor lock-in, less control.
Approach 3: Global Load Balancer with DNS
For multi-region deployment:
┌──────────────────────────────────┐│ Global Load Balancer / DNS ││ (Route 53, Cloudflare, etc) │└──────────────────────────────────┘ │ ┌───┼───┬──────────┐ ▼ ▼ ▼ ▼ LB- LB- LB- LB- US EU APAC ... East London TokyoEach regional LB is itself redundant
Key Metrics:
| Parameter | Value | Why |
|---|---|---|
| Heartbeat Interval | 1 second | Balance detection speed vs. network overhead |
| Failure Threshold | 3 missed heartbeats | Tolerate brief glitches, detect real failures |
| Total Failover Time | 3-10 seconds | Depends on VIP vs DNS |
| Primary Priority | 200 | Primary always becomes active first |
| Standby Priority | 150 | Only primary if primary dies |
Handling Stateful Connections During Failover:
Challenge: If LB-1 fails mid-request, what happens?Options:1. Accept the loss (simplest) - Client retries - OK for idempotent operations (GET, POST with idempotency key)2. Connection Draining (graceful shutdown) - Mark LB as "draining" - Stop accepting new connections - Wait for existing connections to finish - Then remove from service - Pro: Zero connection loss, Con: Takes time3. Connection Replication (complex) - Primary and Standby replicate connection state - Transparent failover - Con: Very complex, high overhead
For most systems, Accept the loss + retry is acceptable; most client libraries already implement exponential backoff.
Question 5: Geographically Distributed Load Balancing
Question: Your company operates in 3 regions: US East, EU West, and APAC. You want to serve users with low latency and provide regional failover. Design a load balancing strategy.
Ideal Answer:
Architecture Overview:
┌──────────────────────────────────────────────────────────┐│ Global Load Balancer (GLB) ││ - Route 53 (AWS), Cloudflare, or custom ││ - Periodically health checks each region ││ - Routes users to nearest healthy region │└──────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ US-EAST │ │ EU-WEST │ │ APAC │ │ Region │ │ Region │ │ Region │ │ │ │ │ │ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ │LB Pair │ │ │ │LB Pair │ │ │ │LB Pair │ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ │ │ │ │ │ │ │ │ │ [Servers] │ │ [Servers] │ │ [Servers] │ │ [DB Pri] │ │ [DB Rep] │ │ [DB Rep] │ └─────────────┘ └─────────────┘ └─────────────┘
Three Layers of Load Balancing:
Layer 1: Global Load Balancer (Geographic Routing)
Route users to the nearest region based on DNS geolocation:
User in San Francisco → GLB checks: US East health = HEALTHY → GLB returns: IP of LB-1 (US East)User in London → GLB checks: EU West health = HEALTHY → GLB returns: IP of LB-2 (EU West)
Implementation (AWS Route 53):
1. Create Health Checks for each region - Health Check on US-EAST: GET /health → 200 OK - Health Check on EU-WEST: GET /health → 200 OK2. Create Geolocation-based Routing Location: North America → US-EAST LB IP Location: Europe → EU-WEST LB IP Location: Asia → APAC LB IP3. Health check interval: 30 seconds If region fails 3 checks: Remove from routing
Alternative: Latency-based routing measures actual latency to each region instead of geography.
Layer 2: Regional Load Balancer (Within-Region Failover)
Each region has a redundant pair (identical to Question 4).
Layer 3: Backend Load Balancing (Server Selection)
Within each region, use algorithms (Round Robin, Least Connections, etc.).
Data Consistency Across Regions:
Challenge: If database is in US-EAST, how do EU and APAC regions access it?
Option 1: Primary-Replica Setup (Recommended)
Write Path: User in US → LB-US → Server → DB-Primary (US)Replication: DB-Primary (US) → Replicates to → DB-Replica (EU) DB-Primary (US) → Replicates to → DB-Replica (APAC)Read Path: User in US → Reads from DB-Primary (low latency, latest) User in EU → Reads from DB-Replica (higher latency, slightly stale)
Pros: Strong consistency, scalable reads
Cons: Replication lag, if US goes down, writes blocked.
Handling Region Failure:
Scenario: US-EAST region becomes unavailableTimeline: t=0s: US-EAST datacenter loses power t=10s: GLB health check times out (failure #1) t=40s: GLB fails health check 3x, removes US-EAST US users: - Try to reach US LB: Connection timeout - Browser/client retries - New DNS query gets EU-WEST or APAC IP - Users rerouted (100-200ms higher latency) - Service continuesDatabase: - US-EAST DB primary is down - Replicas in EU/APAC have latest writes - Manually promote EU replica to primary
Optimizations:
1. DNS TTL – Set short TTL (30-60 seconds) for fast failover
2. Database Replication Lag – Monitor and minimize; route writes/reads to same region
3. Anycast Routing – Same IP from multiple regions; packets route to closest (advanced)
Question 6: Load Balancer Performance Optimization
Question: Your load balancer handles 1 million requests per second. CPU is at 80% and approaching saturation. What strategies would you use to optimize performance without upgrading hardware?
Ideal Answer:
Diagnosis: Where is CPU Usage?
Typical LB CPU breakdown: - TLS handshake/decryption: 40-50% - Connection state tracking: 20-30% - Routing algorithm: 10-15% - Health checks: 5-10% - Logging/monitoring: 10-15%
Optimization Strategies (In Priority Order):
1. Optimize TLS Handling (20-30% reduction)
Strategy A: Enable TLS 1.3 + Session Resumption
TLS Session Resumption (Session Tickets):Connection 1: CLIENT: ClientHello SERVER: ServerHello, [Session Ticket] (Full handshake, CPU intensive)Connection 2 (minutes later): CLIENT: ClientHello + Session Ticket SERVER: Resumes session (no full handshake)
Benefits: TLS 1.3 requires only 1 round trip (vs 2 for 1.2), only supports fast ciphers.
2. Enable HTTP/2 Multiplexing (15-20% reduction)
HTTP/1.1: Client 1 → Dedicated TCP connection → Backend Client 2 → Dedicated TCP connection → Backend (One TCP per client, high connection count)HTTP/2 (Multiplexing): Client 1 & 2 & 3 → Single TCP connection → Multiplexed to backends (Fewer total connections)Result: - Fewer TCP connections to track - Fewer TLS handshakes - 30-50% CPU reduction
3. Reduce Health Check Frequency (10% reduction)
Current (inefficient): Health check interval: 5 seconds Failure threshold: 3 failuresOptimized: Health check interval: 10 seconds Failure threshold: 2 failures Use TCP-only checks instead of HTTP (faster)Example: TCP check: Connect to port (50 microseconds) HTTP check: GET /health, 200 OK (5-10 milliseconds)Trade-off: 90% CPU reduction, less visibility
4. Switch Routing Algorithm (10-15% reduction)
Algorithm CPU Cost: - Round Robin: O(1) - just increment counter - Least Connections: O(n) - check all servers' connection count - IP Hash: O(1) - just hash IPCurrent: Least Connections (O(n))New: Weighted Round Robin (O(1))With 1000 servers: 1000x faster per request lookup Hybrid: "Power of d Choices" - Randomly sample d servers (e.g., d=2) - Pick the one with least connections - O(d) lookup, d << n
5. Offload Logging/Monitoring (10-15% reduction)
Current: Log every request timestamp, source IP, backend, response timeOptions:A. Sampling: Log only 1% of requests - 10x CPU reduction - Still get statistical insightsB. Offload to backend: Backends handle detailed logging - LB logs minimal (only errors) - 15% CPU reductionC. Async logging: Buffer locally, send asynchronously - 5% CPU reduction
6. Vertical Scaling (2-3x improvement)
If optimizations still hit 80% CPU:
Upgrade to newer CPUs with: - AES-NI: Hardware-accelerated encryption (40% faster TLS) - AVX-512: SIMD operations (faster hashing) - More cores: Distribute load across threadsResult: 2-3x CPU capacityCost: $1,000-10,000 per instance
Optimization Priority Checklist:
- ✅ Enable TLS 1.3 + session resumption (20-30%, easy)
- ✅ Enable HTTP/2 multiplexing (15-20%, easy)
- ✅ Reduce health check frequency (10%, easy)
- ✅ Switch to weighted round robin (10-15%, easy)
- ✅ Offload logging/sampling (10-15%, moderate)
- ✅ Upgrade CPU if needed (2-3x, expensive)
Typical Result: 50-80% CPU reduction with minimal effort.
Question 7: Load Balancer and Database Connections
Question: Your app uses a database connection pool (max 100 connections). You have 10 backend servers, each with its own pool. How many connections should each pool have? What issues might arise?
Ideal Answer:
Understanding Connection Pools:
Without connection pool: Open connection → Handshake → Execute query → Close connection (Time: 100-500ms including handshake)With connection pool: Get connection from pool → Execute query → Return to pool (Time: 1-10ms, connection already open)
The Problem: Oversubscription
Naive approach: 10 backend servers × 100 connections per server = 1,000 connectionsDatabase max connections: 100 (default)Result: PROBLEM! 900 connection attempts failError: "too many connections"
Correct Design:
Step 1: Determine Database Connection Limit
Check database configuration: PostgreSQL: SHOW max_connections; MySQL: SHOW VARIABLES LIKE 'max_connections';Typical default: 100Typical enterprise: 500-5000
Step 2: Account for Connection Overhead
Database reserved connections: - Replication connections: 2-5 - Admin connection: 1 - Monitoring agents: 1-3Available for application: Total - ReservedExample (max_connections=100): Total: 100 Reserved: 10 Available for app: 90
Step 3: Divide Among Backend Servers
Formula: Pool size per server = (Available connections) / (Num servers)Example: 90 available / 10 servers = 9 connections per serverConfiguration per backend: database: pool: maxConnections: 9 minConnections: 3 maxIdleTime: 300 # Close idle after 5 minutes
Accounting for Peak Load:
If not all servers healthy (e.g., 80% uptime):Safer formula: Pool size = Available connections / (Expected minimum healthy servers)Example: 90 connections / 8 servers (80% of 10) = 11 connections per server
Common Issues:
Issue 1: Connection Leaks
Problem: Application opens connection but forgets to close itTimeline: t=0: App opens connection 1 t=5s: Connection never closed (bug) t=10s: App opens connection 2 t=100s: All 9 pool connections exhausted t=101s: App request fails: "No available connections"Solution: - Use connection timeouts - Monitor pool utilization - Code review
Issue 2: Long-Running Queries
Problem: Query takes 30 seconds Connection held for 30 secondsScenario (pool size 9): Request 1: Uses connection 1 (t=0-30s) Request 2: Uses connection 2 (t=1-31s) ... Request 9: Uses connection 9 (t=8-38s) Request 10 (t=9s): Wants connection, all 9 in use → Request queues and waitsSolution: - Optimize slow queries - Set query timeout (e.g., 5 seconds) - Increase pool size (but increases DB load)
Monitoring and Alerting:
Key metrics:1. Database Active Connections: Alert if: > 80% of max_connections2. Per-Server Pool Utilization: Alert if: > 90% for 1 minute3. Connection Wait Time: Alert if: > 1 second4. Connection Lifecycle: Alert if: Opened but never closed (leak detection)
Database-Side Tuning (If Needed):
Option A: Increase Database Limits (if possible) max_connections = 500 shared_buffers = 256MB Cost: More RAM needed Benefit: Each server can have larger poolOption B: Use Database Connection Pooler (PgBouncer, ProxySQL) Backends → PgBouncer → Database Multiplexes: 1000 client connections → 50 actual DB connections Benefit: Support more concurrent users without overloading DB Trade-off: Additional component, slight latency
Summary: Connection Pool Sizing
| Component | Count | Reasoning |
|---|---|---|
| DB Max Connections | 100 | Default, check your DB |
| Reserved | 10 | Replication, admin, monitoring |
| Available | 90 | 100 – 10 |
| Backend Servers | 10 | Your LB routes to 10 |
| Pool per Server | 9 | 90 / 10 |
| Total at Peak | 90 | 9 × 10 (all healthy) |
Question 8: Load Balancer and Cache Coherence
Question: Your application uses distributed caching (Redis cluster). With a load balancer, the same user can hit different backend servers. How do you ensure cache coherence? What issues arise?
Ideal Answer:
The Problem: Race Conditions
Scenario: Concurrent requestsUser Request 1 (t=0): Update profile name → Server 1 Server 1: GET from Redis {name: "John"}User Request 2 (t=1): Update profile email → Server 2 Server 2: GET from Redis {name: "John"}Server 1 (t=2): Modifies to {name: "Jane", email: "john@..."} Server 1: SET to Redis (overwrites previous state)Server 2 (t=3): Modifies to {name: "John", email: "jane@..."} Server 2: SET to Redis (overwrites Server 1's changes!)Result: Jane's email update is lost! (Lost update problem)
This is the fundamental cache coherence challenge.
Solution 1: Sticky Sessions
If same user always goes to same server:
- All user’s requests go to Server 1
- No race condition (only one server touches user data)
Pros: Simpler code
Cons: Uneven load, failover complexity.
Solution 2: Versioning with CAS (Compare-and-Swap)
Use atomic operations provided by Redis:
Pseudocode:Server 1: 1. GET user:123:version (current = 5) 2. GET user:123:profile:v5 3. Update value locally 4. INCR user:123:version (new = 6) 5. SET user:123:profile:v6Server 2 (concurrent): 1. GET user:123:version (gets 5) 2. GET user:123:profile:v5 3. Update value locally 4. INCR user:123:version (also tries = 6) 5. Detects conflict: version changed 6. Retries: re-read v6, increment to v7, updateResult: No lost updates, but retry logic needed
Redis Implementation (WATCH/MULTI/EXEC):
WATCH user:123:versionMULTI GET user:123:profile # ... update logic ... SET user:123:profile {updated} INCR user:123:versionEXECIf another client changed user:123:version: WATCH detects the change EXEC fails (returns nil) Client retries from WATCH
Pros: Handles concurrent updates without locking
Cons: Retry logic needed, possible retry storms.
Solution 3: Distributed Locks
Use locks to ensure only one server updates at a time:
User Request: Update profileServer 1: 1. ACQUIRE lock on user:123:profile 2. GET user:123:profile 3. Update 4. SET user:123:profile {updated} 5. RELEASE lockServer 2 (concurrent): 1. ACQUIRE lock -- BLOCKS 2. Waits for Server 1 3. Once released, acquires lock 4. Proceeds with updateResult: Serialized updates, no race condition
Redis Implementation (Redlock):
ACQUIRE lock: SET user:123:lock "server-2-uuid" EX 10 NX (Set if not exists, expire in 10 seconds)If someone else has lock: SET fails, server waits/retriesRELEASE lock: IF lock_value == "server-2-uuid": DEL user:123:lock (Compare before delete)
Pros: Strong consistency
Cons: Blocking, potential deadlocks, performance hit.
Solution 4: Event Sourcing
Instead of storing current state, store immutable events:
User Request 1: Change name to "Jane" Event: {"type": "NameChanged", "name": "Jane", ts: t1} APPEND to user:123:eventsUser Request 2: Change email Event: {"type": "EmailChanged", "email": "jane@...", ts: t2} APPEND to user:123:eventsRebuilding state (any server): Read all events: [NameChanged, EmailChanged] Replay: Start with {}, apply NameChanged, apply EmailChanged Result: {name: "Jane", email: "jane@..."}
Pros: No race conditions, full audit trail
Cons: Complex, state reconstruction overhead.
Comparison Table:
| Solution | Complexity | Performance | Consistency |
|---|---|---|---|
| Sticky Sessions | Low | High | Strong |
| Versioning + CAS | Medium | Good | Eventual (with retries) |
| Distributed Locks | Medium | Low (blocking) | Strong |
| Event Sourcing | High | Medium | Eventual (by replay) |
| Stateless Design | High | High | N/A (no state) |
Best Practice Recommendation:
- Use sticky sessions (simplest) if app tolerates server failures
- Use stateless design (best practice) if possible—store state in DB, cache read-only
- Use CAS/versioning if cache updates necessary, conflict rate low
- Use locks only if serialization required (financial transactions)
Tips and Key Takeaways
When discussing load balancers in system design interviews:
1. Always Design for Redundancy
- No single point of failure
- Deploy load balancers in pairs with automatic failover
- Monitor health continuously
2. Choose the Right Layer
- ALB for web applications, intelligent routing
- NLB for low-latency, high-throughput scenarios
3. Handle State Properly
- Sticky sessions for simple cases
- External cache (Redis) for scale
- Stateless design (JWT) for distributed systems
4. Account for Regional Growth
- Global load balancer with health checks
- Regional failover documented and tested
- Database replication strategy defined
5. Measure Before Optimizing
- Identify actual bottlenecks (TLS, connections, etc.)
- Optimize in priority order
6. Understand Trade-offs
- Latency vs. intelligent routing
- Scalability vs. consistency
- Simplicity vs. resilience
7. Don’t Over-Engineer
- Simple algorithms (Round Robin, Least Connections) work for most cases
- Complex adaptive algorithms introduce operational complexity
- Keep it simple unless you have specific needs
The best load balancer design is one that’s appropriate for your specific use case, not the most complex or feature-rich design.