Capacity Planning
Agentium includes a built-in capacity planning library for modeling LLM inference infrastructure. It answers questions like:- How many GPUs do I need for N concurrent users?
- What happens to latency when I add NAND SSD offloading?
- What’s the KV cache pressure for my workload mix?
- Where is the TTFT SLA breach point?
| Tier | What | Where |
|---|---|---|
| Tier 1 | Pure-math capacity library | @agentium/core — zero dependencies |
| Tier 2 | Runtime session profiling | @agentium/core + @agentium/observability |
| Tier 3 | Interactive dashboard app | apps/capacity-planner/ (Next.js) |
Quick Start
Glossary
Every term used in the capacity planning system, explained in detail.KV Cache (Key-Value Cache)
During autoregressive generation, each transformer layer computes Key and Value projections for every token. Without caching, generating token N would require recomputing K and V for all N-1 prior tokens — quadratic cost per step. The KV cache stores these projections so each decode step only reads them — reducing cost to linear per step. The KV cache is the single largest consumer of GPU memory during inference. For Llama 3.1 70B at 128K context, the KV cache alone is ~40 GB in bf16 — larger than many GPUs.KV Bytes Per Token
The memory required to store one token’s KV cache entry across all layers:2× accounts for both the Key tensor and the Value tensor. Each layer has its own independent set of K and V vectors, and each KV head stores a vector of head_dim floating-point values.
Example — Llama 3.1 70B in bf16:
2 × 80 layers × 8 kv_heads × 128 head_dim × 2 bytes = 327,680 bytes (~320 KB per token)
This means a single 128K-context session consumes 128,000 × 320 KB = 40 GB of KV cache.
Attention Types
How a model organizes its attention heads directly determines KV cache size:| Type | Full Name | Description | KV Heads | KV Size Impact |
|---|---|---|---|---|
| MHA | Multi-Head Attention | Every query head has its own dedicated KV head. Original transformer design. | = query heads | Baseline (largest) |
| GQA | Grouped Query Attention | Multiple query heads share one KV head. Groups of query heads attend to the same K/V vectors. | fewer than query heads | Reduced by group factor (typically 4-8×) |
| MQA | Multi-Query Attention | All query heads share a single KV head. Extreme compression. | 1 | Minimal (smallest possible) |
Layers
The number of transformer blocks stacked sequentially in the model. Each layer has its own independent attention weights and stores its own KV cache. More layers = deeper model = more KV memory per token.| Model | Layers | Impact |
|---|---|---|
| Llama 3.1 8B | 32 | Lightweight |
| Llama 3.1 70B | 80 | KV cache scales 2.5× vs 8B |
| Llama 3.1 405B | 126 | Nearly 4× the 8B’s KV per token |
Head Dimension
The size of each attention vector (both Q/K/V). Determined byhidden_dim / num_attention_heads. Larger head dimensions store more information per attention head but increase KV cache proportionally.
Most modern models use 128 (Llama, Mistral, Mixtral). Falcon uses 64. Gemma 2 9B uses 256.
Hidden Dimension
The width of the model’s internal representation — the size of the vector that represents each token as it flows through the network. Determines the model’s capacity to represent complex patterns. Related to head_dim viahidden_dim = attention_heads × head_dim.
FFN Dimension
The intermediate size of the feed-forward network inside each transformer layer. Typically 3-4× the hidden dimension. Affects prefill compute cost because FFN operations scale linearly with N per layer.HBM (High Bandwidth Memory)
The GPU’s on-chip memory (often called VRAM). This is where model weights, KV cache, and activations must reside for active inference. HBM is fast (~2-3.35 TB/s on modern GPUs) but limited in capacity (24-80 GB per GPU). The entire capacity planning problem reduces to: what fits in HBM?HBM Slots
The number of concurrent sessions that can have their full KV cache resident in GPU HBM. These sessions can generate tokens at full speed with no restore penalty. When HBM is full, new sessions must either wait or be served from NAND (with restore latency).Weight Memory
The GPU memory consumed by the model’s parameters (weights). This is a fixed cost that must be paid regardless of how many users are served.| Precision | Llama 70B Weight Size | Notes |
|---|---|---|
| bf16 | 140 GB | Full quality, needs 2+ H100s |
| int8 | 70 GB | Fits on 1× H100 |
| int4 (AWQ/GPTQ) | 35 GB | Fits on 1× RTX A5000 with KV headroom |
NAND SSD Offloading
Using NVMe solid-state drives attached to each GPU server to store KV cache for inactive (parked) sessions. When a parked session becomes active, its KV cache is loaded from NAND back into HBM. NAND expands the total number of sessions the system can manage but does not help active inference speed — decoding still requires KV data in HBM.| Storage Tier | Bandwidth | Latency | Role |
|---|---|---|---|
| GPU HBM | 2,000–3,350 GB/s | ~100ns | Active decoding |
| NVMe Gen4 SSD | ~7 GB/s | ~100µs | Cold session parking |
| NVMe Gen5 SSD | ~14 GB/s | ~100µs | Faster cold parking |
NAND Slots
The number of sessions that can be parked on NAND SSD while inactive. Computed astotal_nand_gb / kv_per_session_gb. These sessions can be restored to HBM when they become active, at the cost of restore latency.
Restore Latency
The time required to load a parked session’s KV cache from NAND SSD back into GPU HBM. This is the “wake-up cost” for a cold session.effective_bw = nand_bw / parallel_streams.
Cold Ratio
The percentage of total sessions that are parked on NAND at any given moment (inactive, not generating tokens). Typical values:- 20-30% — most sessions are active (interactive chat)
- 50% — half parked (async agent workloads with tool waits)
- 70-80% — most parked (background research agents)
concurrent_active = total_sessions × (1 - cold_ratio)
TPOT (Time Per Output Token)
The latency for each decode step — generating one output token. Decoding is memory-bandwidth-bound because each step must stream the entire KV cache for all active sequences through HBM.TTFT (Time To First Token)
The latency from when a user submits their prompt to when the first output token arrives. TTFT is dominated by prefill — processing the entire input prompt through every layer to build the KV cache. Prefill is compute-bound (not memory-bound like decode) because attention scales quadratically with prompt length. Under concurrent load, prefills are serialized on the GPU compute path. With C concurrent users, a random user waits for C/2 prefills ahead of them:TTFT Breach Point
The maximum number of concurrent users before average TTFT exceeds the configured SLA threshold. Computed by solving:Single Prefill Time
The time to process one prompt through all layers with no queue contention. This is the atomic unit that TTFT is built from.Prefix Caching / Prefix Hit Rate
When multiple requests share the same prefix (system prompt, RAG context, few-shot examples), the KV cache for that prefix can be computed once and reused. A prefix cache hit skips the expensive prefill entirely for the shared portion. A 60% hit rate can effectively double throughput — the biggest “free” optimization in production inference.Tensor Parallelism
Splitting a model across multiple GPUs within the same node. Each GPU holds a shard of the weights and a shard of each KV cache. GPUs communicate via NVLink during each forward pass.- Increases total HBM (more GPUs = more memory)
- Increases aggregate bandwidth (faster TPOT)
- Increases aggregate TFLOPS (faster prefill, lower TTFT)
- Adds ~5-15% communication overhead via NVLink
Workload Mix
The distribution of session types by token intensity:| Category | Token Range | Typical Use Case | KV Cache (70B, fp8) |
|---|---|---|---|
| Light | 0 – 50K | Quick Q&A, lookups | up to 7.8 GB |
| Medium | 50K – 200K | Multi-turn explanations | 7.8 – 31.2 GB |
| Heavy | 200K – 500K | Deep research, SWE tasks | 31.2 – 78.1 GB |
| Extreme | 500K+ | Full repo analysis, long agents | 78.1+ GB |
{ extreme: 1, heavy: 2, medium: 3, light: 4 } (10 users) produces a weighted average context of ~197K tokens.
Session Category Thresholds
The token boundaries used by theSessionProfiler to classify live sessions:
Overhead
A fixed 5 GB budget for activations, CUDA contexts, framework metadata, and vLLM’s internal data structures (page tables, scheduling state). This is subtracted from total HBM before computing KV capacity.Precision Options
KV cache and model weights can be quantized independently:KV Precision
| Precision | Bytes/element | KV/token (70B) | Quality Impact |
|---|---|---|---|
bf16 | 2 | 320 KB | Lossless — baseline |
fp8 | 1 | 160 KB | ~0.1-0.3% perplexity increase — standard production choice |
int8 | 1 | 160 KB | Slightly more lossy than fp8 |
int4 | 0.5 | 80 KB | Noticeable degradation on long contexts |
Weight Precision
| Precision | 70B Size | Min GPUs (H100) | Quality Impact |
|---|---|---|---|
bf16 | 140 GB | 2× H100 | Lossless |
int8 | 70 GB | 1× H100 | Minor degradation |
int4 (AWQ/GPTQ) | 35 GB | 1× RTX A5000 | Acceptable for most tasks |
Model Architectures
15 models included out of the box, with specs sourced from HuggingFaceconfig.json:
| Model | Type | Layers | KV Heads | Head Dim | KV/token (bf16) | Max Context |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | GQA | 32 | 8 | 128 | 128 KB | 128K |
| Llama 3.1 70B | GQA | 80 | 8 | 128 | 320 KB | 128K |
| Llama 3.1 405B | GQA | 126 | 8 | 128 | 504 KB | 128K |
| Llama 2 7B | MHA | 32 | 32 | 128 | 512 KB | 4K |
| Llama 2 13B | MHA | 40 | 40 | 128 | 640 KB | 4K |
| Llama 2 70B | GQA | 80 | 8 | 128 | 320 KB | 4K |
| Mixtral 8×7B | GQA | 32 | 8 | 128 | 128 KB | 32K |
| Mixtral 8×22B | GQA | 56 | 8 | 128 | 176 KB | 64K |
| Falcon 7B | MQA | 32 | 1 | 64 | 8 KB | 8K |
| Falcon 40B | GQA | 60 | 8 | 64 | 60 KB | 8K |
| Mistral 7B | GQA | 32 | 8 | 128 | 128 KB | 32K |
| Phi-3 Mini | MHA | 32 | 32 | 96 | 384 KB | 128K |
| Gemma 2 9B | GQA | 42 | 8 | 256 | 168 KB | 8K |
| Gemma 2 27B | GQA | 46 | 16 | 128 | 184 KB | 8K |
GPU Specs
| GPU | HBM | Bandwidth | bf16 TFLOPS | NVLink | Use Case |
|---|---|---|---|---|---|
| H100 SXM | 80 GB | 3.35 TB/s | 989 | 900 GB/s | Premium cloud inference |
| A100 SXM | 80 GB | 2.0 TB/s | 312 | 600 GB/s | Standard cloud inference |
| L40S | 48 GB | 0.864 TB/s | 366 | None | Mid-tier / batch workloads |
| RTX A5000 | 22.5 GB | 0.768 TB/s | 65 | None | Cost-sensitive self-hosted |
| RTX 4090 | 24 GB | 1.008 TB/s | 330 | None | Dev / small-scale serving |
- HBM — Total GPU memory. Determines how much fits (weights + KV + overhead).
- Bandwidth — How fast data streams from HBM. Determines TPOT (decode speed).
- bf16 TFLOPS — Peak compute throughput. Determines prefill speed and TTFT.
- NVLink — GPU-to-GPU interconnect bandwidth. Only matters for tensor parallelism across multiple GPUs in the same node. GPUs without NVLink communicate over PCIe (~64 GB/s), which adds latency for multi-GPU setups.
How the Math Works — Step by Step
This section walks through every calculation the capacity planner performs, with a worked example using Llama 3.1 70B on 8× RTX A5000 with int4 AWQ weights and fp8 KV cache.Step 1: KV Bytes Per Token
What: How many bytes does one token cost in the KV cache? Formula:2— one Key vector + one Value vector per layerlayers(80) — each of the 80 transformer blocks stores its own K and Vkv_heads(8) — GQA means only 8 KV heads (not all 64 query heads)head_dim(128) — each head stores a 128-dimensional vectorprecision_bytes(1 for fp8) — bytes per floating-point element
× 2 bytes = 327,680 bytes = 320 KB/token — double.
Code: kvBytesPerToken(arch, "fp8") in kv-estimator.ts
Step 2: KV Cache Per Session
What: Total KV memory for one session at a given average context length. Formula:kvCacheForContext(arch, 16384, "fp8") in kv-estimator.ts
Step 3: Weight Memory
What: GPU memory consumed by the model’s parameters. Formula:weightMemory(arch, "int4") in kv-estimator.ts
Step 4: Free HBM for KV Cache
What: How much GPU memory is available for KV cache after weights and overhead. Formula:capacity-planner.ts
Step 5: HBM Slots (Active Sessions)
What: How many concurrent sessions fit in free HBM. Formula:maxConcurrentSessions() in capacity-planner.ts
Step 6: NAND Slots (Parked Sessions)
What: How many additional sessions can be parked on SSD. Formula:capacity-planner.ts
Step 7: TPOT (Decode Latency)
What: How long each output token takes to generate. Why bandwidth-bound: Each decode step must read the entire KV cache for all active sequences from HBM. The GPU compute is idle waiting for memory. Formula:estimateTpot() in latency-estimator.ts
Step 8: Single Prefill Time
What: Time to process one prompt through all layers (no queue). Why compute-bound: Prefill runs the full attention computation (quadratic in prompt length) plus FFN (linear). The GPU compute units are saturated, not memory. Formula:singlePrefillMs() in latency-estimator.ts
Step 9: TTFT Under Load
What: How long a user waits for the first token when other users are also submitting prompts. Why it degrades: Prefills are serialized on the GPU compute path. With C concurrent users, each user’s prefill waits behind the others in a queue. Formula:(C+1)/2 is the average queue position — if C users arrive simultaneously, a random user is at position 1 to C uniformly, so the average wait is (C+1)/2 prefills.
Calculation (10 concurrent users, 4K prompt):
estimateTtft() in latency-estimator.ts
Step 10: TTFT Breach Point
What: Maximum concurrent users before average TTFT exceeds the SLA. Formula (solving Step 9 for C):ttftBreachPoint() in latency-estimator.ts
Step 11: Restore Latency
What: Time to wake up a cold session from NAND SSD. Formula:restoreLatency() in latency-estimator.ts
Step 12: Monthly GPU Cost
What: Infrastructure cost estimate. Formula:monthlyGpuCost() in infra-cost.ts
Step 13: Weighted Average Context (Workload Mix)
What: Converts the session type distribution into a single average context length. Formula:planCapacity().
Code: weightedAvgContext() in capacity-planner.ts
Full Worked Example Summary
Config: Llama 3.1 70B, 8× RTX A5000, int4 weights, fp8 KV, 16K avg context| Step | Calculation | Result |
|---|---|---|
| KV/token | 2 × 80 × 8 × 128 × 1 | 160 KB |
| KV/session (16K) | 16,384 × 160 KB | 2.5 GB |
| Weights (int4) | 140 × 0.25 | 35 GB |
| Total HBM | 22.5 × 8 | 180 GB |
| Free for KV | 180 - 35 - 5 | 140 GB |
| HBM slots | floor(140 / 2.5) | 56 |
| NAND slots (4TB/GPU) | floor(32,000 / 2.5) | 12,800 |
| TPOT (1 user) | 2.68 GB / 6.14 TB/s | 0.44 ms |
| Single prefill (4K) | 44T flops / 182T flops/s | 241.9 ms |
| TTFT (10 users) | 241.9 × 5.5 | 1,330 ms |
| TTFT breach (5s SLA) | 2×5000/241.9 - 1 | 41 users |
| Restore (Gen4 NVMe) | 0.3125 GB / 7 GB/s | 44.6 ms |
| Monthly cost | 8 × $1.10 × 730 | $6,424 |
The CapacityPlan Object
TheplanCapacity() function returns a complete CapacityPlan with every metric:
| Field | Type | Description |
|---|---|---|
model | ModelArchitecture | The model being planned for |
hardware | HardwareConfig | GPU setup (type, count, NAND) |
kvPrecision | KvPrecision | KV cache precision used |
weightPrecision | WeightPrecision | Weight quantization used |
totalHbmGb | number | Total GPU memory across all GPUs |
weightMemoryGb | number | Memory consumed by model weights |
freeHbmForKvGb | number | HBM available for KV cache after weights + overhead |
kvBytesPerToken | number | Bytes per token in KV cache |
hbmSlots | number | Concurrent sessions fitting in HBM |
nandSlots | number | Sessions parkable on NAND SSD |
totalSessions | number | hbmSlots + nandSlots |
tpotMs | number | Estimated Time Per Output Token (ms) |
ttftMs | number | Estimated Time To First Token (ms) |
restoreLatencyMs | number | null | NAND → HBM restore time (null if no NAND) |
ttftBreachPoint | number | Max concurrent users before 5s TTFT SLA breach |
monthlyGpuCostUsd | number | Estimated monthly GPU infrastructure cost |
Interactive Dashboard
Theapps/capacity-planner/ Next.js app provides a full interactive UI with:
- Model selector (all 15 architectures)
- GPU type, count, NAND per GPU sliders
- KV/weight precision selectors
- Workload controls (avg context, cold ratio, SLA thresholds)
- Per-GPU breakdown panel (shows free HBM + NAND per card)
- 6 interactive charts:
- Users vs GPUs — session capacity scaling with GPU count
- Users vs Context — how capacity drops as context grows
- TPOT vs Users — decode latency at different context sizes
- TTFT vs Users — prefill queue congestion with SLA breach markers
- Restore Budget — NAND restore time at Gen4/Gen5 bandwidth
- GPU vs NAND — total sessions across NAND sizes
Related
- KV Estimator —
kvBytesPerToken,kvCacheForContext,maxContextForMemory,weightMemory - Capacity Planner —
planCapacity,maxConcurrentSessions,estimateGpuCount - Latency Estimator —
estimateTpot,estimateTtft,ttftBreachPoint,restoreLatency - Session Profiler — Runtime monitoring with EventBus integration
- Edge GPU Monitoring — Real-time GPU memory, utilization, and temperature via
nvidia-smi - Observability Metrics — Prometheus counters for KV cache and session categories
- Examples — Live monitor, config comparison, KV sizing