Capacity planning

1. What is Capacity Planning?

Capacity planning is the discipline of answering one question in a systematic way:

“Given how the business expects customers to use the site, what amount of compute, database, cache, network, and storage do we need so the system remains fast and reliable, even during peaks and failures?”

Conceptually, it is translation work:

Start from business expectations: users, orders, campaigns, markets.
Translate them into system load: traffic, data, background work.
Decide how much capacity per component is needed: services, DBs, caches, queues, storage.
Build in headroom and elasticity so that surprises do not take you down.

For e‑commerce, capacity planning is not just a cost exercise; it is revenue protection:

Product discovery must stay responsive during sales and campaigns.
Checkout must remain reliable even if some dependencies are slow.
The backend must absorb short spikes without cascading failures.

In interviews, the key idea to communicate is:

You do not guess; you start from business numbers, model user behavior, and reason your way down to per‑component needs.

2. From Business Metrics to Technical Load

2.1 Business Metrics as the Starting Point

Capacity planning begins from a business view of the world rather than from CPU and memory directly. Typical business metrics:

DAU / MAU (daily / monthly active users)
Traffic patterns: time‑of‑day, weekday vs weekend, seasonal events
Conversion rate: what fraction of active users place orders
Orders per day / per hour
Planned events: Black Friday, flash sales, influencer campaigns

What matters conceptually is:

How many real people touch the system in a time window?
How do they behave (browsing, searching, comparing, checking out)?
Are traffic patterns smooth or spiky?

2.2 Technical Metrics as a Reflection of Business Activity

From those business metrics, you conceptually derive technical ones:

Requests per second (RPS) to your APIs
Queries per second (QPS) to databases and search engines
Cache operations per second
Messages per second flowing through queues and streams
Data transfer per second (network bandwidth)

Important conceptual moves:

Think in flows, not individual calls:
“A typical session hits these services and these data stores.”
Think in peaks, not averages:
“What happens in the busiest 5–15 minutes of the day?”
Think in ratios, not precise numbers:
“Most traffic is read‑heavy; writes are much rarer and more precious.”

In an interview, it is enough to say:

“I start from DAU/MAU, expected peak multipliers, and typical user journeys (browse, search, PDP, cart, checkout). From that, I derive approximate RPS per journey and per service.”

3. Load by Journey

Rather than obsessing over exact numbers, you categorize the shapes of load:

Browse & home pages: high volume, read‑heavy, fairly cacheable.
Search: CPU‑intensive (ranking, scoring), often delegated to a search engine.
Product details (PDP): still read‑heavy, more personalized, partly cacheable.
Cart: mix of reads and writes, often user‑specific and short‑lived data.
Checkout: write‑heavy, transactional, involves multiple dependencies.

Conceptually:

The upper part of the funnel (browse/search/PDP) drives most of the read load.
The lower part (cart/checkout) drives most of the write and consistency requirements.
You size your system so that heavy reads do not crush your critical writes.

In a microservices architecture, this becomes a map:

Catalog / product service handles most browse and PDP reads.
Search service handles search traffic.
Cart service manages temporary state and updates.
Order and payment services handle the transactional core.

Capacity planning is, in large part, understanding how each of these flows exercises each component.

4. Planning per Component

4.1 API Gateway / Load Balancer

Mentally, treat the gateway and load balancers as your front door:

They must be able to terminate all incoming connections and TLS handshakes.
They must be configured to spread traffic across instances/pods reliably.
Their capacity is usually high, but you must know their limits and quotas.

Conceptually:

You estimate your peak request rate and payload sizes.
You check that your API Gateway / ALB can sustain that with generous headroom.
You remember that they are rarely the first bottleneck, but misconfiguration can make them one (e.g., low connection limits).

4.2 Application Services (Java, Node)

For Spring Boot and Node services, capacity planning is conceptually about how each instance behaves under load:

There is a practical limit to how many concurrent requests an instance can handle while keeping latency acceptable.
You translate business load (RPS) into instances needed, given what a single instance can process.
For Java in particular, you think about:
Thread pools (to avoid thread starvation)
Heap size and GC behavior (to avoid long pauses)

At the conceptual level, your stance is:

“I know that each instance has a sustainable RPS at acceptable latency. I provision enough instances so that even peak RPS keeps them under safe CPU and memory utilization, with capacity to absorb spikes and failures.”

4.3 Python Workers & Asynchronous Jobs

Python workers (or any async workers) are about throughput over time, not instant RPS:

How many background tasks will be created during peaks?
How quickly must they be processed to meet SLAs?

Conceptually:

You treat workers like a production line: more workers = higher throughput.
You size worker pools so that backlogs do not grow uncontrollably during peaks.
You place them behind a queue (SQS/Kafka) so that
short bursts are smoothed out over time.

4.4 Databases (RDS, DynamoDB)

Databases are usually the hardest components to scale and the most dangerous bottlenecks.

For relational DBs (RDS/Aurora):

You think in terms of queries per second, CPU, memory, and I/O.
You use caching and read replicas to protect the primary.
You avoid letting chatty or unbounded queries dominate capacity.

For DynamoDB:

You think in terms of read/write patterns per access pattern.
You ensure keys are well distributed to avoid hot partitions.
You keep a comfortable margin over expected read/write rates,
often starting with on‑demand and later tuning to provisioned.

Conceptually, your principle is:

“Because DBs scale less elastically than stateless services,
I aggressively reduce the amount of traffic that ever reaches them, using caches, read replicas, and event‑driven updates.”

4.5 Caches (Redis / ElastiCache)

Caches are your pressure valves for read load:

They sit between services and databases.
They serve the hot, frequently requested data at low latency.
They dramatically reduce the number of round‑trips to the database.

Conceptually, you:

Identify which data is read‑heavy and not changing constantly (catalog, configuration, some user preferences).
Design your system so that these reads hit cache most of the time.
Treat your cache as a first‑class tier that itself must be highly available and monitored.

The key mental model:

“Every read that is satisfied by cache is a read that does not stress my database.”

4.6 Messaging (SQS / Kafka)

Messaging systems decouple producers (APIs) from consumers (workers and backends).

Conceptually, queues give you:

Elasticity in time: producers can spike without requiring consumers to scale instantly.
Backpressure: if downstream is slow, the queue absorbs the difference temporarily.
Isolation: failures in downstream systems do not directly crash frontends.

In capacity terms, you think about:

What are the peak event rates during key scenarios (orders, payments, notifications)?
How long can events wait and still meet SLAs?
How many consumers are needed to keep the backlog healthy?

4.7 Object Storage (S3)

S3 provides practically unbounded capacity for storage and very high throughput.

Capacity planning is less about “Can S3 handle it?” (usually yes) and more about:

How much data accumulates (storage cost, lifecycle policies).
How downstream consumers (CDNs, services, analytics) handle the traffic.
How you design keys and prefixes to avoid hot partitions.

Conceptually, think of S3 as your long‑term memory:

Huge, cheap, reliable.
But every consumer of that data still has to be sized properly.

5. AWS Platform Choices

5.1 EC2 vs ECS vs EKS vs Lambda

An architect does not memorize a right answer; they apply a decision framework:

EC2 directly:
You want maximum control over instances.
You accept more operational work.
You use Auto Scaling Groups directly.
ECS (on EC2/Fargate):
You want container orchestration without managing Kubernetes.
You have multiple services and want service‑level scaling.
Fargate is attractive when you want to focus on tasks, not instances.
EKS:
You already have strong Kubernetes experience and tooling.
You want Kubernetes ecosystem features (CRDs, operators, etc.).
You accept more control plane complexity in return for flexibility.
Lambda:
You have event‑driven, spiky, or low/medium RPS workloads.
You are happy with function boundaries and statelessness.
You accept limitations (cold starts, execution time limits, certain languages).

Conceptual rule:

Use containerized services (ECS/EKS) for high, steady traffic APIs.
Use Lambda or Fargate for spiky or asynchronous workloads.
Use EC2 directly where you need low‑level control or have legacy constraints.

5.2 Auto Scaling as a Safety Net, Not a Crutch

Auto Scaling (whether for EC2, ECS, or Lambda) is about keeping the system within safe utilization bounds as load fluctuates.

Conceptually, you:

Decide your target utilization range (e.g., 50–70% CPU for services).
Let auto scaling increase or decrease capacity to stay within that range.
Combine reactive scaling (based on metrics) with scheduled scaling for known events.

The key mental point:

Auto scaling does not remove the need for capacity planning;
it just adjusts around the plan. You still must understand what “safe” looks like.

6. Peak Traffic and Failure Scenarios (Thinking in Scenarios)

Capacity planning becomes much more concrete when you think in terms of scenarios, not just steady state.

6.1 Black Friday / Seasonal Peaks

Traffic grows significantly but often predictably.
The business announces campaigns in advance.
You have historical data from previous years.

Conceptually, you:

Forecast a multiplicative increase on top of normal peak.
Pre‑scale core components (services, DB, caches) ahead of the event.
Harden all critical paths (checkout, payment) with aggressive monitoring and rollback.

6.2 Flash Sales and Sudden Spikes

Traffic can spike 10–20× within minutes.
Auto scaling might not react fast enough by itself.

Conceptually, you:

Place queues between frontends and heavy backends.
Use rate limiting and virtual waiting rooms to shape incoming load.
Pre‑warm capacity when possible (scheduled scaling around marketing pushes).

6.3 Dependency Slowness (e.g., Payment Gateway)

Failure is not always “down”; sometimes it is “slow”. When a payment gateway slows down:

Threads pile up waiting on external calls.
Latency and timeouts increase.
Retries can multiply the traffic going to the same slow dependency.

Conceptually, you:

Use timeouts, circuit breakers, and bulkheads to isolate the slow dependency.
Use queues and asynchronous flows so that user‑visible endpoints are not tightly coupled to external latency.
Accept that during incidents, throughput may be intentionally reduced to protect the system as a whole.

7. Capacity vs Performance vs Scalability

These words are related but distinct:

Capacity:
How much load the system can sustain while still meeting its performance targets.
Example: “We can handle 5,000 RPS at p95 latency of 300 ms.”
Performance:
The speed and responsiveness of the system at a given load.
Example: “At 1,000 RPS, checkout completes in 150 ms p95.”
Scalability:
How the system’s capacity changes as you add more resources.
Example: “If we double our app servers, our max RPS nearly doubles.”

Conceptual view:

Performance is the shape of the curve (latency vs load).
Capacity is the safe region on that curve where SLAs are met.
Scalability is how far you can push that region by adding more resources, and how easily.

Good capacity planning aims to:

Place normal and peak operation well inside the safe region.
Ensure the system is scalable enough that the safe region can be expanded without redesign.

8. Real‑World Rules of Thumb (Architect Mindset)

Architects often apply heuristics instead of raw formulas:

Always think in peaks and percentiles:

Averages hide the real story.
Plan for the busiest 5–15 minutes and for p95/p99 latency.

Protect your databases:

Cache aggressively.
Use read replicas thoughtfully.
Avoid ad‑hoc heavy queries on the primary.

Use headroom deliberately:

Do not run critical components near saturation.
50–70% CPU as a design target is common for application services.

Use queues to decouple rates:

Frontends can accept work faster than backends can process it.
Queues make the system more tolerant to temporary overloads.

Prefer graceful degradation over outages:

If some components are overloaded, temporarily reduce features (e.g., limit recommendations, tone down personalization) to keep checkout alive.

Measure, then refine:

Initial capacity planning is based on assumptions.
Once traffic is live, continuously refine based on real data.

9. Common Conceptual Mistakes in Capacity Planning

Thinking only from the infra side:

Starting from CPU and memory without understanding user behavior.
Fix: always start from business metrics and user journeys.

Ignoring external dependencies:

Payment, email, SMS, search, fraud checks.
These have their own rate limits and SLAs; they must be part of the plan.

Assuming auto scaling will save you:

Auto scaling has reaction time and warm‑up time.
It cannot fix fundamental bottlenecks (e.g., single DB instance at 100% CPU).

Underestimating write and consistency paths:

Reads are common, but writes are critical.
Checkout and inventory must preserve correctness under high load.

No backpressure strategy:

Unlimited queues or unconstrained retries generate storms.
Good designs use rate limits, bounded queues, and idempotent operations.

Not rehearsing peak scenarios:

Until you run load tests and game days, your capacity plan is mostly a theory.

1. What is capacity planning (in simple terms)?

Capacity planning is answering a simple question in a quantitative way:

“Given our expected traffic and behavior, how much compute, database, cache, network, and storage do we need so the system stays fast and reliable, even at peak load?”

For e‑commerce, it is critical because:

Sales & campaigns: Black Friday, flash sales, influencer campaigns can multiply traffic by 5–20×. If you guess wrong, you lose real money.
Checkout reliability: Browsing can be slow and users may tolerate it a bit. Checkout failures = direct revenue loss, chargebacks, angry customers.
Downstream dependencies (payment, fraud, search, SMS, email) can slow down or fail; your system must have enough buffer (queues, retries, extra capacity) to survive.

Think of it as:

Business goals → traffic/usage numbers → per-component load → required capacity (+ safety margin).

2. From business metrics → technical metrics

You almost always start from business-side numbers and translate.

2.1 Core business metrics

Typical inputs:

Total registered users: e.g. 10M
DAU / MAU:
- DAU = daily active users
- MAU = monthly active users
Peak concurrent users:
- Users simultaneously active during peak minute
Orders per day and conversion rate:
- Conversion rate = orders per day / visitors per day

Plus, you estimate user behavior:

Average sessions per user per day
Average page views per session
Average API calls per page view

2.2 Mapping to requests per second (RPS)

Basic formula:

$Total requests/day = DAU \times requests per user per day$ Total requests/day=DAU×requests per user per day
$Average RPS = \frac{Total requests/day}{86400}$ Average RPS=86400Total requests/day
$Peak RPS = Average RPS \times Peak factor$ Peak RPS=Average RPS×Peak factor

Peak factor (rule-of-thumb):

3–5× for normal day traffic
10–20× for big sale events

Similarly for orders:

$Orders/day = DAU \times conversion rate$ Orders/day=DAU×conversion rate
$Average orders/s = \frac{Orders/day}{86400}$ Average orders/s=86400Orders/day
$Peak orders/s \approx Average orders/s \times Peak factor$ Peak orders/s≈Average orders/s×Peak factor

Then you break down RPS by journey:

Browse / home / category
Search
Product details page (PDP)
Cart
Checkout (address, shipping, payment, place order)

3. Step‑by‑step calculations

3.1 Start with user journeys

Define a simple traffic model (numbers are assumptions to reason with, not “truth”):

Average requests per page view: 3 API calls (layout, recommendations, inventory, etc.)
Typical session breakdown:
- 5 category/browse page views
- 3 search result page views
- 4 PDPs
- 1 cart page
- 1 checkout flow (3 steps: shipping, payment, place‑order) for converting users

From this you can derive per‑journey RPS.

Example structure

Assume peak total API traffic is 2,000 RPS. Split:

Journey	Share of traffic	RPS
Browse	40%	800
Search	20%	400
PDP	20%	400
Cart	10%	200
Checkout	10%	200

You can refine by service:

catalog-service – handles browse + PDP
search-service – handles search
cart-service – handles cart + part of checkout
order-service – handles place‑order, order status
payment-service – orchestrates payment gateway

3.2 API traffic per service

Let’s say:

Browse and PDP hit catalog-service
Cart and checkout hit cart-service and order-service
Search hits search-service

Then:

catalog-service RPS = browse (800) + PDP (400) = 1,200 RPS
search-service RPS = 400 RPS
cart-service RPS = cart (200) + part of checkout (say 100) = 300 RPS
order-service RPS = place‑order + order reads (say 100 RPS)

General formula:

RPS(service) = Σ [RPS(journey) × fraction_of_journey_hitting_this_service]

You repeat that for each microservice.

3.3 Database reads vs writes

For each API call, estimate:

Reads per request
Writes per request
How many reads hit cache vs DB

Define:

$R_{total}$ Rtotal = total logical reads/s the app needs
$W_{total}$ Wtotal = total writes/s the app needs
Cache hit ratio $H$ H (0–1)

Then:

DB reads/s = $R_{db} = R_{total} \times (1 - H)$ Rdb=Rtotal×(1−H)
DB writes/s = $W_{db} = W_{total}$ Wdb=Wtotal (writes nearly always hit DB)

Example (catalog)

Assume at peak:

catalog-service = 1,200 RPS
Each request:
- needs 3 key lookups (e.g., product data, price, inventory) → 3 logical reads
So: $R_{total} = 1, 200 \times 3 = 3, 600$ Rtotal=1,200×3=3,600 reads/s
Cache hit ratio (Redis) = 90% → $H = 0.9$ H=0.9

Then:

DB reads/s = $3, 600 \times (1 - 0.9) = 360$ 3,600×(1−0.9)=360 reads/s
DB writes/s (e.g., for analytics) maybe 10 writes/s

That’s a huge reduction in DB load due to caching.

3.4 Cache hit ratios (and capacity)

Key ideas:

Cache all read‑heavy, infrequently changing data: product details, category trees, configurations.
Aim for 80–95% hit ratio on hot paths (catalog, PDP).
Use TTL + write‑through / write‑behind patterns as needed.

Impact on capacity:

Redis throughput ≈ hundreds of thousands of operations/s on modest hardware.
Key formula:

Cache QPS = Σ [request_RPS × cache_lookups_per_request]

Then check:

If cache_QPS is 50k ops/s and 1 Redis node can handle 200k ops/s at desired latency, you’re at 25% utilization → plenty of headroom.

3.5 Message queues (order events, payments, etc.)

Queue capacity planning is about:

Throughput – msgs/s you need to push through
Backlog – how many messages can be temporarily queued during spikes or downstream slowness
Consumer capacity – how fast workers process messages

Basic formulas:

Avg msgs/s:
$msgs/s = \frac{msgs/day}{86400}$ msgs/s=86400msgs/day
If each order produces EE events:
- order-created, order-paid, order-shipped, etc.
Then:
- order_events/day = orders/day × E
- order_events/s (avg) = order_events/day / 86400

Consumer throughput:

Per worker: ${throughput}_{worker} = \frac{1}{avg processing time per msg}$ throughputworker=avg processing time per msg1
Total consumer throughput: ${throughput}_{total} = workers \times {throughput}_{worker}$ throughputtotal=workers×throughputworker

To avoid backlog growth:

throughput_total ≥ peak msgs/s.

If you want a buffering queue (e.g., for spiky payments), you allow temporary backlog and size the queue (and DLQ) accordingly.

4. Component‑wise capacity planning

4.1 API Gateway / Load Balancer

AWS components: API Gateway (for public APIs), ALB/NLB in front of services.

Key metrics:

RPS
Active connections
TLS handshakes (for HTTPS)
Per‑LB limits (LCUs, etc.) – ensure peak fits within AWS limits

For ALB:

LCUs (Load Balancer Capacity Units) are based on:
- New connections/s
- Active connections
- Processed bytes/s
- Rule evaluations/s

Rough approach:

Estimate peak RPS and average payload size.
Compute peak data throughput:
- e.g., 2,000 RPS × 50 KB = 100,000 KB/s ≈ 100 MB/s.
Check against AWS ALB LCU thresholds (from docs) and add headroom.

You usually over‑provision LB capacity; cost is small relative to the application.

4.2 Java & Node services (CPU, memory, threads, GC)

For each service (Spring Boot, Node.js):

Benchmark or load test one instance:
- Example: m6i.large (2 vCPU, 8 GB)
- It can handle ~150 RPS at p95 < 200 ms with 70% CPU.
Decide your target utilization:
- 50–70% CPU under peak to keep latency headroom.
Compute required instance count:

$Instances = ⌈ \frac{Peak RPS}{RPS per instance at target utilization} ⌉$ Instances=⌈RPS per instance at target utilizationPeak RPS⌉

Example:

Peak RPS for catalog-service: 1,200
One instance handles 150 RPS at 70% CPU
Required instances:
- $⌈ 1, 200 / 150 ⌉ = ⌈ 8 ⌉ = 8$ ⌈1,200/150⌉=⌈8⌉=8
Add N+1 redundancy and zone failure headroom → say min 10 instances.

Threads & GC (Java):

Use bounded thread pools; avoid unbounded @Async pools.
For blocking IO (DB, HTTP clients), threads can be ~2–4× CPU cores.
GC:
- Prefer G1 or ZGC for large heaps.
- Keep heap size such that GC pauses are < 100–200 ms.
- Monitor young/old gen promotion rates under load.

Node.js:

Single‑threaded event loop; parallelism via multiple processes (cluster) or containers.
Capacity is more RPS per process and number of processes per node.

4.3 Python workers & async jobs

Typical for:

Order processing
Notifications (email, SMS)
ML inference
Batch jobs

Key parameters:

msg_processing_time (seconds/message)
msgs_per_worker = 1 / msg_processing_time
total_throughput = workers × msgs_per_worker

Example:

Each worker processes a message in 50 ms → 20 msgs/s.
Peak order events: 100 msgs/s.
Workers needed: $100 / 20 = 5$ 100/20=5
Add headroom: run 8–10 workers.

You then size:

ECS tasks / K8s pods / Lambda concurrency based on that.

4.4 Databases: RDS & DynamoDB

4.4.1 RDS (Aurora / MySQL / Postgres)

Capacity variables:

vCPU, RAM
IOPS (storage)
Max connections
Read replicas (for read scaling)

Steps:

Convert RPS to DB QPS:
- QPS = reads/s + writes/s.
Load test to find:
- Max QPS at acceptable latency (p95 < X ms) and CPU < 70%.
Pick instance class:
- Start with e.g. r6g.large for memory‑heavy, m6i.large for balanced.
For reads:
- Offload via read replicas and caching.
- If primary can handle 500 QPS, you might add 2 replicas for total ~1500 QPS.
For writes:
- Harder to scale; optimize schema, batching, and use queues where possible.

Use AWS RDS best practices:

Monitor CPU, memory, I/O, and query latency.
Use vertical scaling plus read replicas as needed.

4.4.2 DynamoDB

Key unit: RCU (read capacity unit) and WCU (write capacity unit).

Simplified:

1 RCU ≈ 1 strongly consistent read/s for item ≤ 4 KB (2 RCUs for 4–8 KB, etc.).
1 WCU ≈ 1 write/s for item ≤ 1 KB.

So:

Compute read/write requests per second.
Normalize by item size.

Example:

order items size: 2 KB → 1 strongly consistent read = 1 RCU (since ≤ 4 KB).
Peak:
- 500 reads/s, 50 writes/s.
Needed:
- RCUs ≈ 500 (plus headroom, say ×1.4 → 700 RCUs).
- WCUs ≈ 50 (plus headroom → 70 WCUs).

Best practice:

Start with on‑demand for new workloads; switch to provisioned with auto‑scaling when patterns stabilize and target ~70% utilization for headroom.

4.5 Caches (Redis / ElastiCache)

Parameters:

QPS (get/set ops/s)
Dataset size (hot data)
Latency target (e.g., < 1–2 ms in‑memory)
Replication (for HA) and sharding.

Steps:

Compute cache QPS:
- Example: 15k ops/s total at peak.
Estimate memory:
- Number of keys × avg bytes per key (value + overhead).
Choose node type:
- E.g., cache.r6g.large 13.68 GiB.
Check vendor guidance:
- A single node can handle hundreds of thousands of ops/s with small values.

Rule‑of‑thumb:

Use clustered Redis if:
- Dataset > 50–60% of node RAM, or
- QPS > what a single node can safely handle with headroom.

4.6 Messaging (SQS / Kafka)

SQS:

Extremely high throughput; usually not the bottleneck.
Single queue supports thousands of msgs/s easily.
Capacity planning focuses on consumer side.

Kafka:

Throughput per partition.
Rule‑of‑thumb:
- Start with 3 partitions per topic.
- Increase partitions as throughput grows (e.g., 10+ partitions for high throughput).
Consumer group parallelism = number of partitions.

Compute:

bytes/s = msgs/s × avg msg size.
Ensure brokers’ disk/network can handle this.

4.7 Object storage (S3)

Capacity planning is mostly about:

Storage size:
- Number of objects × avg size.
Read/write rate:
- RPS and bytes/s.

S3 scales virtually “infinitely”, but:

Avoid hot partitions by using good key distribution.
Ensure downstream consumers (e.g., CDN, app servers) and network links can handle throughput.

5. AWS sizing decisions

5.1 EC2 vs ECS vs EKS vs Lambda

Think in terms of baseline load, traffic pattern, and team maturity.

EC2:

You manage instances directly.
Good when:
- You need fine‑grained OS control.
- Simpler operations or legacy apps.
Capacity planning: instances + Auto Scaling Groups.

ECS (on EC2 or Fargate):

Container orchestration without running Kubernetes.
Good default for microservices if you prefer containers.
Use Service Auto Scaling (task count) and Capacity Provider (EC2 capacity).

EKS:

Full Kubernetes; more flexibility, but more ops complexity.
Good when you already have strong K8s expertise and tooling.

Lambda:

Event‑driven, serverless.
Great for:
- Spiky, unpredictable loads.
- Background tasks, low to medium RPS APIs.
Capacity planning in terms of:
- Concurrency (functions running at the same time).
- Duration.
Be careful with:
- Cold starts (especially for Java).
- Downstream DB connection limits.

Rule‑of‑thumb:

Steady high RPS core APIs → ECS/EKS on EC2 or Fargate.
Spiky/background/event flows → Lambda or ECS Fargate.

5.2 Auto Scaling Group (ASG) calculations

Classic pattern:

Determine per‑instance capacity:
- e.g., one m6i.large handles 150 RPS at 70% CPU.
Determine peak RPS with headroom (e.g. 1.5× expected).
Target utilization (CPU, RPS) = 50–70%.

Formula:

$Desired instances at peak = ⌈ \frac{Peak RPS}{RPS/instance at target utilization} ⌉$ Desired instances at peak=⌈RPS/instance at target utilizationPeak RPS⌉

Then:

Set:
- min_capacity = enough for normal traffic and HA (e.g., 3–5 instances).
- max_capacity = enough for worst‑case spikes + zone failure.

You may also:

Use scheduled scaling to pre‑warm for known events (Black Friday, campaigns).
Use predictive scaling based on historical metrics.

5.3 Read/Write capacity units (DynamoDB)

For DynamoDB (provisioned capacity):

Estimate:
- reads_per_second
- writes_per_second
- item_size_kb
Compute:

\text{RCUs} = \text{reads_per_second} \times \text{RCU_per_read}
\text{WCUs} = \text{writes_per_second} \times \text{WCU_per_write}

Where:

RCU_per_read:
- Strongly consistent: 1 RCU for ≤ 4 KB.
- Eventually consistent: 0.5 RCU for ≤ 4 KB.
WCU_per_write:
- 1 WCU for ≤ 1 KB, 2 WCUs for 1–2 KB, etc.

Add ~30–50% headroom or target ~70% utilization.

5.4 Network bandwidth considerations

Compute network usage:

bytes/s = RPS × avg payload (request + response).
Convert to Mb/s:
- Mb/s = (bytes/s × 8) / 1,000,000.

Example:

2,000 RPS × 10 KB avg response = 20,000,000 bytes/s ≈ 160 Mb/s.

Check:

Per‑instance ENI bandwidth (instance type limits).
Cross‑AZ and cross‑region traffic costs.
LB throughput.

The practical constraint is often NIC limits on EC2 instances or egress cost, not S3 or SQS.

6. Peak traffic scenarios

6.1 Black Friday (predictable big peak)

Characteristics:

Traffic ramp‑up over hours; sustained high for many hours.
High browse/search, high conversion.

Strategies:

Use historical data to forecast traffic multiplier.
Pre‑scale ASGs, RDS instances, and DynamoDB capacity (if provisioned).
Increase cache size and TTLs for catalog.
Ensure read replicas and failover strategies are ready.
Run load tests at 2–3× expected peak.

6.2 Flash sales (short, steep spikes)

Characteristics:

Traffic jumps 10–20× within minutes.
Short‑lived (15–60 minutes).

Challenges:

Auto scaling may lag (scale‑out takes a few minutes).
DB and payment can be overwhelmed briefly.

Strategies:

Pre‑warm capacity just before sale (scheduled scaling).
Use queues (SQS/Kafka) for:
- Order creation, reduced to “ticket” + async processing.
Implement rate limiting and virtual queues on frontend to spread load.

6.3 Payment gateway slowness

Symptoms:

Higher latency on payment API calls.
Increased in‑flight checkout requests.
Threads saturated waiting on external calls → local CPU maybe low, but thread pools full.

Effects on capacity:

Effective concurrency per instance limited by max concurrent requests.
Queue lengths grow; timeouts and retries amplify load.

Mitigations:

Strict timeouts, circuit breakers, bulkheads (separate thread pools).
Asynchronous payment flow:
- Place order → enqueue payment → return “pending payment” → notify.
Over‑provision workers for payment callback/notification flows.
Protect DB from retry storms (idempotency keys).

7. Capacity vs performance vs scalability

These three are related but not the same:

Capacity: “How much can the system handle today with current resources?”
- e.g., can handle 2,000 RPS at p95 < 300 ms.
Performance: “How fast does it respond under a given load?”
- Latency, throughput, error rate at a particular RPS.
Scalability: “How does capacity change when we add resources?”
- Linear: doubling instances ≈ doubles RPS.
- Sub‑linear: due to DB bottlenecks, locks, or contention.

Text diagram:

textTraffic ↑
        │       ┌──────────── capacity limit (current infra)
        │      /
Latency │     /
        │    /
        └───┴───────────────────→ RPS
           performance curve

Scaling: shift the curve to the right by adding capacity.

Capacity planning is about picking a point on the performance curve with enough headroom, and ensuring you can shift the curve (scale) quickly when needed.

8. Real‑world rules of thumb (architect mindset)

Some commonly used heuristics:

Headroom:
- Plan for 1.5–2× your expected max peak RPS.
- Run at 50–70% CPU under expected peak for core services.
- Keep DB CPU < 60–70% under peak.
Cache aggressively:
- Aim for 80–95% cache hit ratio on catalog and PDP.
- If DB CPU is spiky, first check cache effectiveness before scaling DB.
Databases are the hardest to scale:
- Scale compute tier horizontally.
- Offload read‑heavy use cases to cache and read replicas.
- Use CQRS and event‑driven patterns for write‑heavy flows.
Queues protect downstream systems:
- Use SQS/Kafka between “fast” frontend and “slow” backends.
- Design workers to be idempotent so you can safely retry.
Start simple, then optimize:
- Use on‑demand DB capacity and auto scaling at the beginning.
- After understanding patterns, tune to provisioned and reserved capacity for cost.
Test with realistic scenarios:
- Load test end‑to‑end with:
  - Black Friday‑like traffic
  - Flash sales
  - Downstream slowness (payment, email, etc.)

9. Common mistakes in capacity planning

Using only average numbers:
- Average RPS is useless; peaks kill you.
- Always think in terms of 95th/99th percentile and burst traffic.
Ignoring external dependencies:
- Payment, fraud, SMS, email, search.
- If they slow down, your threads/queues fill up → cascaded failures.
No safety margin:
- Running everything at 80–90% CPU “because it’s efficient”.
- Leaves no room for spikes or noisy neighbors.
Over‑trusting auto scaling:
- Auto scaling reacts after metrics cross thresholds.
- Scale‑out takes minutes; spikes can come in seconds.
Not capacity‑planning the database:
- Planning only EC2/ECS pods but neglecting RDS/DynamoDB limits.
- DB becomes the bottleneck; scaling app servers only makes it worse.
No load testing:
- Relying on intuition instead of numbers.
- Missing hidden locks, slow queries, thread pool starvation.
Ignoring backpressure:
- No rate limiting.
- Unlimited queues leading to massive backlog and timeouts.
One giant cache:
- No distinction between hot and cold data.
- Eviction storms causing DB spikes.

10. Worked example: “10M users, 1M DAU, 5% conversion”

Let’s go step‑by‑step and design end‑to‑end capacity.
Assumptions are for reasoning, not production values.

10.1 Business inputs

Total registered users: 10M
DAU: 1M
Conversion rate: 5%
- → Orders/day = 1M × 5% = 50,000 orders/day

Assume:

Requests per user per day (API calls): 20
- “Light” users: 10–15
- “Heavy” users: 50+
Peak factor: 8× (for a busy sale day, but not Black Friday‑max).

10.2 Compute total API requests & RPS

Total requests per day:

requests_per_day = DAU × requests_per_user
= 1,000,000 × 20 = 20,000,000 requests/day

Average RPS:

avg_RPS = 20,000,000 / 86,400 ≈ 231 RPS

Peak RPS:

peak_RPS = avg_RPS × peak_factor = 231 × 8 ≈ 1,848 RPS
Round to 2,000 RPS for safety.

10.3 Orders per second

Orders/day = 50,000.

Average orders/s:

avg_orders_s = 50,000 / 86,400 ≈ 0.58 orders/s

Peak orders/s (with factor 8):

peak_orders_s ≈ 0.58 × 8 ≈ 4.6 → roughly 5 orders/s

For more conservative planning, use 10 orders/s peak.

10.4 Traffic by journey & service

Assume 2,000 peak RPS split:

Journey	% of RPS	RPS	Main Service(s)
Browse	40%	800	`catalog-service`
Search	20%	400	`search-service`
PDP	20%	400	`catalog-service`
Cart	10%	200	`cart-service`
Checkout	10%	200	`order-service`, `cart`

Service RPS:

catalog-service = 800 + 400 = 1,200 RPS
search-service = 400 RPS
cart-service = 200 RPS
order-service = main part of checkout: say 100 RPS for writes (place order) and 100 RPS for reads (order status etc.)

10.5 Capacity for Java/Node services (ECS on EC2)

Assume per instance benchmark:

Type: m6i.large (2 vCPU, 8 GB)
At 70% CPU:
- catalog-service: 150 RPS
- search-service: 120 RPS
- cart-service: 120 RPS
- order-service: 100 RPS (heavier)

Compute instance counts:

catalog-service (Spring Boot):
- Peak RPS = 1,200
- Capacity per instance = 150 RPS
- Instances required:
  - = ceil(1,200 / 150) = ceil(8) = 8
- Add HA + headroom → run 10 instances.
search-service (Node):
- Peak RPS = 400
- Capacity per instance = 120 RPS
- Instances:
  - = ceil(400 / 120) = ceil(3.33) = 4
- Add headroom → 5 instances.
cart-service (Spring Boot):
- Peak RPS = 200
- Capacity per instance = 120 RPS
- Instances:
  - = ceil(200 / 120) = ceil(1.67) = 2
- Add redundancy → 3 instances.
order-service (Spring Boot):
- Peak RPS = 200 total (100 writes + 100 reads)
- Capacity per instance = 100 RPS
- Instances:
  - = ceil(200 / 100) = 2
- But this is critical; run 4–5 instances for reliability and failure zones.

So for these four core services, total: ~22–23 ECS tasks across 3 AZs.

Auto Scaling:

ASG min size: enough for normal traffic (e.g., 2–3 instances per service).
Max size: at least 2× peak requirement for flash sale + failure scenarios.

10.6 DB & cache for catalog

Assume:

catalog-service = 1,200 RPS.
Each request needs 3 product/price/inventory lookups → 3 logical reads.
Cache first (Redis/ElastiCache), then DB (Aurora/Postgres).

Total logical reads/s:

R_total = 1,200 × 3 = 3,600 reads/s.

Assume cache hit ratio H = 0.9:

DB_reads = R_total × (1 - H) = 3,600 × 0.1 = 360 reads/s.

Writes (e.g., analytics, stock changes):

Say 20 writes/s at peak.

Aurora:

Load test: db.r6g.large supports, say, 1,000 simple queries/s at 60% CPU.
Our 380 QPS is well within that.
So:
- Aurora primary: db.r6g.large or db.r6g.xlarge (for more headroom).
- Add 1–2 read replicas if needed for analytics/reporting.

Cache:

Cache QPS = 3,600 (total reads/s).
A single cache.r6g.large can handle >100k ops/s.
Start with 1 primary + 1 replica.
Cluster later if needed.

10.7 Orders DB (DynamoDB example)

Orders/day = 50,000. Peak orders/s = roughly 10.

But besides creation, consider:

Order fetches by user (PDP order widget, order history)
Payment status updates
Shipment updates

Assume:

Peak reads: 200 reads/s
Peak writes: 50 writes/s
Item size: 2 KB (within 4 KB read tier; 1–2 KB write tier)

RCUs:

Strongly consistent, 2 KB item → 1 RCU/read.
RCUs = 200 × 1 = 200
Add headroom (×1.5) → 300 RCUs.

WCUs:

2 KB item → 2 WCUs/write.
WCUs = 50 × 2 = 100
Headroom (×1.5) → 150 WCUs.

Start with on‑demand until patterns stabilize, then move to provisioned with auto scaling targeting ~70% utilization.

10.8 Message queues (order events)

Assume:

Each order generates E = 5 events:
- order-created, payment-initiated, payment-success/failure, order-packed, order-shipped
Orders/day = 50,000

Total events/day:

= 50,000 × 5 = 250,000 events/day

Avg events/s:

= 250,000 / 86,400 ≈ 2.9 events/s

Peak (×8):

≈ 2.9 × 8 ≈ 23 events/s
Round to 30 events/s.

SQS capacity:

30 msgs/s is trivial for SQS; capacity focus is consumer:

Assume Python worker:

Each event processing time = 50 ms
- 1 worker → 20 msgs/s
For 30 msgs/s, need:
- = 30 / 20 = 1.5 → 2 workers, plus HA → run 4–5 workers.

10.9 Payment gateway slowness scenario (same system)

Assume:

Normal payment API latency: 500 ms.
Under incident: 3,000 ms (3 sec).

If checkout calls are synchronous:

order-service has, say, 200 RPS on checkout.
Each payment call ties up:
- 1 HTTP client connection
- 1 app thread

Max concurrent payment calls per instance:

If thread pool size for external calls = 50.
With 3 sec latency, concurrency saturates quickly:
- throughput_per_instance = pool_size / latency = 50 / 3 ≈ 17 RPS.
Under 200 RPS across, say, 4 instances:
- Per instance load ≈ 50 RPS → far above 17 RPS capacity.
- Threads fill, timeouts & errors increase, queue length blows up.

Mitigation in capacity terms:

Either:
- Increase pool size (careful with DB & CPU).
- Or decouple payment via asynchronous flow and queues.
And pre‑provision extra instances during known payment time constraints.

10.10 End‑to‑end capacity summary (for the example)

For “10M users, 1M DAU, 5% conversion” and assumptions above:

API tier:
- Peak RPS: ~2,000
- Catalog: 10 instances (m6i.large)
- Search: 5 instances
- Cart: 3 instances
- Order: 4–5 instances
- All on ECS with ASGs sized for ~2× peak.
DB & cache:
- Catalog DB: Aurora db.r6g.large/xlarge with 1–2 read replicas.
- Orders: DynamoDB table with ~300 RCUs and 150 WCUs (or on‑demand).
- Redis: 1 primary + 1 replica (cache.r6g.large) for catalog and session caching.
Queues & workers:
- SQS for order events: 1 queue, 4–5 Python workers (ECS Fargate or Lambda).
- Payment event queue similarly sized.
LB/API Gateway:
- 1 ALB handling 2,000 RPS (well under LCU limits), plus API Gateway for public APIs as needed.
Network & S3:
- 2,000 RPS × 10 KB = ~160 Mb/s from app tier → easily within typical ENI + ALB limits on chosen instance types.
- S3 for images, logs; capacity essentially not a concern, just costs and key design.
Headroom & resilience:
- All numbers include at least ~1.5× headroom over expected load.
- Capacity planning assumes 3 AZs, with enough instances so that losing 1 AZ still keeps service at acceptable performance.

How to present this in interviews

When asked “How do you do capacity planning for a large‑scale e‑commerce system?”:

Start from business metrics:
- “I’d ask for DAU/MAU, expected peak factor (e.g., Black Friday), conversion rate, and typical user journeys.”
Translate to RPS and orders/s with simple formulas.
Break down by journeys and services:
- Show how 2,000 RPS maps to catalog, search, cart, order services.
Show DB and cache math:
- Reads/writes per request, cache hit ratio, resulting DB QPS.
Discuss per‑component capacity:
- API Gateway/LB, ECS services, DB (RDS/Dynamo), cache, queues, S3.
Show AWS‑specific decisions:
- EC2/ECS/EKS/Lambda choice.
- Auto Scaling Group formulas.
- DynamoDB RCUs/WCUs.
Address peak scenarios and failure modes:
- Black Friday, flash sale, payment slowness.
Call out rules‑of‑thumb and mistakes:
- Headroom, DB bottlenecks, auto scaling lag, impact of external dependencies.

1. What is Capacity Planning?

2. From Business Metrics to Technical Load

2.1 Business Metrics as the Starting Point

2.2 Technical Metrics as a Reflection of Business Activity

3. Load by Journey

4. Planning per Component

4.1 API Gateway / Load Balancer

4.2 Application Services (Java, Node)

4.3 Python Workers & Asynchronous Jobs

4.4 Databases (RDS, DynamoDB)

4.5 Caches (Redis / ElastiCache)

4.6 Messaging (SQS / Kafka)

4.7 Object Storage (S3)

5. AWS Platform Choices

5.1 EC2 vs ECS vs EKS vs Lambda

5.2 Auto Scaling as a Safety Net, Not a Crutch

6. Peak Traffic and Failure Scenarios (Thinking in Scenarios)

6.1 Black Friday / Seasonal Peaks

6.2 Flash Sales and Sudden Spikes

6.3 Dependency Slowness (e.g., Payment Gateway)

7. Capacity vs Performance vs Scalability

8. Real‑World Rules of Thumb (Architect Mindset)

9. Common Conceptual Mistakes in Capacity Planning

1. What is capacity planning (in simple terms)?

2. From business metrics → technical metrics

2.1 Core business metrics

2.2 Mapping to requests per second (RPS)

3. Step‑by‑step calculations

3.1 Start with user journeys

Example structure

3.2 API traffic per service

3.3 Database reads vs writes

Example (catalog)

3.4 Cache hit ratios (and capacity)

3.5 Message queues (order events, payments, etc.)

4. Component‑wise capacity planning

4.1 API Gateway / Load Balancer

4.2 Java & Node services (CPU, memory, threads, GC)

4.3 Python workers & async jobs

4.4 Databases: RDS & DynamoDB

4.4.1 RDS (Aurora / MySQL / Postgres)

4.4.2 DynamoDB

4.5 Caches (Redis / ElastiCache)

4.6 Messaging (SQS / Kafka)

4.7 Object storage (S3)

5. AWS sizing decisions

5.1 EC2 vs ECS vs EKS vs Lambda

5.2 Auto Scaling Group (ASG) calculations

5.3 Read/Write capacity units (DynamoDB)

5.4 Network bandwidth considerations

6. Peak traffic scenarios

6.1 Black Friday (predictable big peak)

6.2 Flash sales (short, steep spikes)

6.3 Payment gateway slowness

7. Capacity vs performance vs scalability

8. Real‑world rules of thumb (architect mindset)

9. Common mistakes in capacity planning

10. Worked example: “10M users, 1M DAU, 5% conversion”

10.1 Business inputs

10.2 Compute total API requests & RPS

10.3 Orders per second

10.4 Traffic by journey & service

10.5 Capacity for Java/Node services (ECS on EC2)

10.6 DB & cache for catalog

10.7 Orders DB (DynamoDB example)

10.8 Message queues (order events)

10.9 Payment gateway slowness scenario (same system)

10.10 End‑to‑end capacity summary (for the example)

How to present this in interviews

Share this:

Related

Leave a comment Cancel reply