Capacity Planning

Agentium includes a built-in capacity planning library for modeling LLM inference infrastructure. It answers questions like:

How many GPUs do I need for N concurrent users?
What happens to latency when I add NAND SSD offloading?
What’s the KV cache pressure for my workload mix?
Where is the TTFT SLA breach point?

The system has three tiers:

Tier	What	Where
Tier 1	Pure-math capacity library	`@agentium/core` — zero dependencies
Tier 2	Runtime session profiling	`@agentium/core` + `@agentium/observability`
Tier 3	Interactive dashboard app	`apps/capacity-planner/` (Next.js)

Quick Start

import {
  planCapacity,
  DEFAULT_ARCHITECTURES,
  DEFAULT_GPU_SPECS,
} from "@agentium/core";

const plan = planCapacity(
  DEFAULT_ARCHITECTURES["llama-3.1-70b"],
  {
    gpu: DEFAULT_GPU_SPECS["h100-sxm"],
    gpuCount: 4,
    nandPerGpuGb: 0,
    nandBandwidthGBs: 7,
  },
  { extreme: 1, heavy: 2, medium: 3, light: 4 },
  "fp8",   // KV precision
  "bf16",  // weight precision
);

console.log(plan.hbmSlots);        // concurrent sessions in HBM
console.log(plan.ttftBreachPoint);  // max users before 5s TTFT SLA breach
console.log(plan.monthlyGpuCostUsd);

Glossary

Every term used in the capacity planning system, explained in detail.

KV Cache (Key-Value Cache)

During autoregressive generation, each transformer layer computes Key and Value projections for every token. Without caching, generating token N would require recomputing K and V for all N-1 prior tokens — quadratic cost per step. The KV cache stores these projections so each decode step only reads them — reducing cost to linear per step. The KV cache is the single largest consumer of GPU memory during inference. For Llama 3.1 70B at 128K context, the KV cache alone is ~40 GB in bf16 — larger than many GPUs.

KV Bytes Per Token

The memory required to store one token’s KV cache entry across all layers:

KV bytes/token = 2 × layers × kv_heads × head_dim × precision_bytes

The 2× accounts for both the Key tensor and the Value tensor. Each layer has its own independent set of K and V vectors, and each KV head stores a vector of head_dim floating-point values. Example — Llama 3.1 70B in bf16: 2 × 80 layers × 8 kv_heads × 128 head_dim × 2 bytes = 327,680 bytes (~320 KB per token) This means a single 128K-context session consumes 128,000 × 320 KB = 40 GB of KV cache.

Attention Types

How a model organizes its attention heads directly determines KV cache size:

Type	Full Name	Description	KV Heads	KV Size Impact
MHA	Multi-Head Attention	Every query head has its own dedicated KV head. Original transformer design.	= query heads	Baseline (largest)
GQA	Grouped Query Attention	Multiple query heads share one KV head. Groups of query heads attend to the same K/V vectors.	fewer than query heads	Reduced by group factor (typically 4-8×)
MQA	Multi-Query Attention	All query heads share a single KV head. Extreme compression.	1	Minimal (smallest possible)

Why it matters: Llama 3.1 70B uses GQA with 64 query heads but only 8 KV heads — an 8× reduction in KV cache compared to MHA. Falcon 7B uses MQA with just 1 KV head — KV cache is only 8 KB/token vs 320 KB for Llama 70B.

Layers

The number of transformer blocks stacked sequentially in the model. Each layer has its own independent attention weights and stores its own KV cache. More layers = deeper model = more KV memory per token.

Model	Layers	Impact
Llama 3.1 8B	32	Lightweight
Llama 3.1 70B	80	KV cache scales 2.5× vs 8B
Llama 3.1 405B	126	Nearly 4× the 8B’s KV per token

Head Dimension

The size of each attention vector (both Q/K/V). Determined by hidden_dim / num_attention_heads. Larger head dimensions store more information per attention head but increase KV cache proportionally. Most modern models use 128 (Llama, Mistral, Mixtral). Falcon uses 64. Gemma 2 9B uses 256.

Hidden Dimension

The width of the model’s internal representation — the size of the vector that represents each token as it flows through the network. Determines the model’s capacity to represent complex patterns. Related to head_dim via hidden_dim = attention_heads × head_dim.

FFN Dimension

The intermediate size of the feed-forward network inside each transformer layer. Typically 3-4× the hidden dimension. Affects prefill compute cost because FFN operations scale linearly with N per layer.

HBM (High Bandwidth Memory)

The GPU’s on-chip memory (often called VRAM). This is where model weights, KV cache, and activations must reside for active inference. HBM is fast (~2-3.35 TB/s on modern GPUs) but limited in capacity (24-80 GB per GPU). The entire capacity planning problem reduces to: what fits in HBM?

total_hbm = gpu_hbm × gpu_count
free_for_kv = total_hbm - weight_memory - overhead (5 GB)
hbm_slots = floor(free_for_kv / kv_per_session)

HBM Slots

The number of concurrent sessions that can have their full KV cache resident in GPU HBM. These sessions can generate tokens at full speed with no restore penalty. When HBM is full, new sessions must either wait or be served from NAND (with restore latency).

Weight Memory

The GPU memory consumed by the model’s parameters (weights). This is a fixed cost that must be paid regardless of how many users are served.

Precision	Llama 70B Weight Size	Notes
bf16	140 GB	Full quality, needs 2+ H100s
int8	70 GB	Fits on 1× H100
int4 (AWQ/GPTQ)	35 GB	Fits on 1× RTX A5000 with KV headroom

NAND SSD Offloading

Using NVMe solid-state drives attached to each GPU server to store KV cache for inactive (parked) sessions. When a parked session becomes active, its KV cache is loaded from NAND back into HBM. NAND expands the total number of sessions the system can manage but does not help active inference speed — decoding still requires KV data in HBM.

Storage Tier	Bandwidth	Latency	Role
GPU HBM	2,000–3,350 GB/s	~100ns	Active decoding
NVMe Gen4 SSD	~7 GB/s	~100µs	Cold session parking
NVMe Gen5 SSD	~14 GB/s	~100µs	Faster cold parking

NAND Slots

The number of sessions that can be parked on NAND SSD while inactive. Computed as total_nand_gb / kv_per_session_gb. These sessions can be restored to HBM when they become active, at the cost of restore latency.

Restore Latency

The time required to load a parked session’s KV cache from NAND SSD back into GPU HBM. This is the “wake-up cost” for a cold session.

restore_time = kv_size_gb / nand_bandwidth_gb_per_sec

Example: 5 GB KV cache on Gen4 NVMe (7 GB/s) = 714ms restore time. When multiple sessions restore simultaneously, they share the SSD bandwidth pipe, increasing individual restore time: effective_bw = nand_bw / parallel_streams.

Cold Ratio

The percentage of total sessions that are parked on NAND at any given moment (inactive, not generating tokens). Typical values:

20-30% — most sessions are active (interactive chat)
50% — half parked (async agent workloads with tool waits)
70-80% — most parked (background research agents)

concurrent_active = total_sessions × (1 - cold_ratio)

TPOT (Time Per Output Token)

The latency for each decode step — generating one output token. Decoding is memory-bandwidth-bound because each step must stream the entire KV cache for all active sequences through HBM.

TPOT = (context_tokens × batch_size × kv_bytes_per_token) / aggregate_bandwidth

TPOT scales linearly with context length and batch size. A user perceives this as the streaming speed — lower TPOT = faster text output. Interactive applications target < 50ms TPOT (~20 tokens/sec streaming).

TTFT (Time To First Token)

The latency from when a user submits their prompt to when the first output token arrives. TTFT is dominated by prefill — processing the entire input prompt through every layer to build the KV cache. Prefill is compute-bound (not memory-bound like decode) because attention scales quadratically with prompt length. Under concurrent load, prefills are serialized on the GPU compute path. With C concurrent users, a random user waits for C/2 prefills ahead of them:

TTFT(C users) = single_prefill_time × (C + 1) / 2

Interactive applications target < 1-5 seconds TTFT.

TTFT Breach Point

The maximum number of concurrent users before average TTFT exceeds the configured SLA threshold. Computed by solving:

single_prefill × (C + 1) / 2 = ttft_sla
→ C = 2 × ttft_sla / single_prefill - 1

Adding more GPUs increases TFLOPS, which reduces single prefill time, which pushes the breach point out. Adding NAND does not move the breach point — NAND doesn’t help prefill compute.

Single Prefill Time

The time to process one prompt through all layers with no queue contention. This is the atomic unit that TTFT is built from.

prefill_flops = (4 × N² × hidden_dim + 4 × N × ffn_dim) × layers
single_prefill = prefill_flops / (gpu_tflops × gpu_count × efficiency)

Where efficiency is ~35% (real-world vs peak TFLOPS). The quadratic attention term dominates at long contexts — a 32K prompt takes ~64x longer than a 4K prompt, not 8x.

Prefix Caching / Prefix Hit Rate

When multiple requests share the same prefix (system prompt, RAG context, few-shot examples), the KV cache for that prefix can be computed once and reused. A prefix cache hit skips the expensive prefill entirely for the shared portion. A 60% hit rate can effectively double throughput — the biggest “free” optimization in production inference.

Tensor Parallelism

Splitting a model across multiple GPUs within the same node. Each GPU holds a shard of the weights and a shard of each KV cache. GPUs communicate via NVLink during each forward pass.

Increases total HBM (more GPUs = more memory)
Increases aggregate bandwidth (faster TPOT)
Increases aggregate TFLOPS (faster prefill, lower TTFT)
Adds ~5-15% communication overhead via NVLink

Workload Mix

The distribution of session types by token intensity:

Category	Token Range	Typical Use Case	KV Cache (70B, fp8)
Light	0 – 50K	Quick Q&A, lookups	up to 7.8 GB
Medium	50K – 200K	Multi-turn explanations	7.8 – 31.2 GB
Heavy	200K – 500K	Deep research, SWE tasks	31.2 – 78.1 GB
Extreme	500K+	Full repo analysis, long agents	78.1+ GB

The workload mix determines the weighted average context length and drives the capacity plan. A mix of { extreme: 1, heavy: 2, medium: 3, light: 4 } (10 users) produces a weighted average context of ~197K tokens.

Session Category Thresholds

The token boundaries used by the SessionProfiler to classify live sessions:

SESSION_CATEGORY_THRESHOLDS = {
  light:   50_000,     // up to 50K tokens
  medium:  200_000,    // 50K - 200K
  heavy:   500_000,    // 200K - 500K
  extreme: Infinity,   // 500K+
};

Overhead

A fixed 5 GB budget for activations, CUDA contexts, framework metadata, and vLLM’s internal data structures (page tables, scheduling state). This is subtracted from total HBM before computing KV capacity.

Precision Options

KV cache and model weights can be quantized independently:

KV Precision

Precision	Bytes/element	KV/token (70B)	Quality Impact
`bf16`	2	320 KB	Lossless — baseline
`fp8`	1	160 KB	~0.1-0.3% perplexity increase — standard production choice
`int8`	1	160 KB	Slightly more lossy than fp8
`int4`	0.5	80 KB	Noticeable degradation on long contexts

fp8 KV is standard practice — it halves memory and bandwidth usage with negligible quality loss.

Weight Precision

Precision	70B Size	Min GPUs (H100)	Quality Impact
`bf16`	140 GB	2× H100	Lossless
`int8`	70 GB	1× H100	Minor degradation
`int4` (AWQ/GPTQ)	35 GB	1× RTX A5000	Acceptable for most tasks

The standard production setup is fp8 KV + bf16 weights for cloud GPUs, or fp8 KV + int4 weights for cost-sensitive self-hosted deployments.

Model Architectures

15 models included out of the box, with specs sourced from HuggingFace config.json:

Model	Type	Layers	KV Heads	Head Dim	KV/token (bf16)	Max Context
Llama 3.1 8B	GQA	32	8	128	128 KB	128K
Llama 3.1 70B	GQA	80	8	128	320 KB	128K
Llama 3.1 405B	GQA	126	8	128	504 KB	128K
Llama 2 7B	MHA	32	32	128	512 KB	4K
Llama 2 13B	MHA	40	40	128	640 KB	4K
Llama 2 70B	GQA	80	8	128	320 KB	4K
Mixtral 8×7B	GQA	32	8	128	128 KB	32K
Mixtral 8×22B	GQA	56	8	128	176 KB	64K
Falcon 7B	MQA	32	1	64	8 KB	8K
Falcon 40B	GQA	60	8	64	60 KB	8K
Mistral 7B	GQA	32	8	128	128 KB	32K
Phi-3 Mini	MHA	32	32	96	384 KB	128K
Gemma 2 9B	GQA	42	8	256	168 KB	8K
Gemma 2 27B	GQA	46	16	128	184 KB	8K

Custom architectures can be passed to any function:

const myModel: ModelArchitecture = {
  id: "my-model",
  displayName: "My Custom 13B",
  family: "custom",
  params: "13B",
  layers: 40,
  attentionHeads: 40,
  kvHeads: 8,
  headDim: 128,
  hiddenDim: 5120,
  ffnDim: 13824,
  maxContext: 32768,
  attentionType: "gqa",
  weightSizeBf16Gb: 26,
};

GPU Specs

GPU	HBM	Bandwidth	bf16 TFLOPS	NVLink	Use Case
H100 SXM	80 GB	3.35 TB/s	989	900 GB/s	Premium cloud inference
A100 SXM	80 GB	2.0 TB/s	312	600 GB/s	Standard cloud inference
L40S	48 GB	0.864 TB/s	366	None	Mid-tier / batch workloads
RTX A5000	22.5 GB	0.768 TB/s	65	None	Cost-sensitive self-hosted
RTX 4090	24 GB	1.008 TB/s	330	None	Dev / small-scale serving

Key metrics explained:

HBM — Total GPU memory. Determines how much fits (weights + KV + overhead).
Bandwidth — How fast data streams from HBM. Determines TPOT (decode speed).
bf16 TFLOPS — Peak compute throughput. Determines prefill speed and TTFT.
NVLink — GPU-to-GPU interconnect bandwidth. Only matters for tensor parallelism across multiple GPUs in the same node. GPUs without NVLink communicate over PCIe (~64 GB/s), which adds latency for multi-GPU setups.

How the Math Works — Step by Step

This section walks through every calculation the capacity planner performs, with a worked example using Llama 3.1 70B on 8× RTX A5000 with int4 AWQ weights and fp8 KV cache.

Step 1: KV Bytes Per Token

What: How many bytes does one token cost in the KV cache? Formula:

kv_bytes_per_token = 2 × layers × kv_heads × head_dim × precision_bytes

Why each term:

2 — one Key vector + one Value vector per layer
layers (80) — each of the 80 transformer blocks stores its own K and V
kv_heads (8) — GQA means only 8 KV heads (not all 64 query heads)
head_dim (128) — each head stores a 128-dimensional vector
precision_bytes (1 for fp8) — bytes per floating-point element

Calculation:

2 × 80 × 8 × 128 × 1 = 163,840 bytes = 160 KB/token

If we used bf16 instead of fp8, it would be × 2 bytes = 327,680 bytes = 320 KB/token — double. Code: kvBytesPerToken(arch, "fp8") in kv-estimator.ts

Step 2: KV Cache Per Session

What: Total KV memory for one session at a given average context length. Formula:

kv_per_session = avg_context_tokens × kv_bytes_per_token

Calculation (16K context):

16,384 tokens × 163,840 bytes = 2,684,354,560 bytes = 2.5 GB per session

Calculation (128K full context):

131,072 tokens × 163,840 bytes = 21,474,836,480 bytes = 20 GB per session

Code: kvCacheForContext(arch, 16384, "fp8") in kv-estimator.ts

Step 3: Weight Memory

What: GPU memory consumed by the model’s parameters. Formula:

weight_memory = weight_size_bf16 × precision_ratio

Precision ratios: bf16 = 1.0, int8 = 0.5, int4 = 0.25 Calculation (int4 AWQ):

140 GB × 0.25 = 35 GB

Without quantization (bf16), weights would be 140 GB — needing 2× H100s just for weights. With int4, they fit on a single GPU with room to spare. Code: weightMemory(arch, "int4") in kv-estimator.ts

Step 4: Free HBM for KV Cache

What: How much GPU memory is available for KV cache after weights and overhead. Formula:

total_hbm = gpu_hbm × gpu_count
free_hbm = total_hbm - weight_memory - overhead

Calculation (8× RTX A5000):

total_hbm = 22.5 GB × 8 = 180 GB
free_hbm  = 180 - 35 - 5 = 140 GB

The 5 GB overhead covers CUDA contexts, vLLM paging metadata, activation buffers, and framework state. Code: Lines 92-94 in capacity-planner.ts

Step 5: HBM Slots (Active Sessions)

What: How many concurrent sessions fit in free HBM. Formula:

hbm_slots = floor(free_hbm / kv_per_session)

Calculation (16K avg context, fp8):

hbm_slots = floor(140 GB / 2.5 GB) = 56 sessions

Calculation (4K avg context, fp8):

kv_per_session = 4,096 × 163,840 = 0.625 GB
hbm_slots = floor(140 / 0.625) = 224 sessions

Notice how context length dominates: 4× shorter context = 4× more sessions. Code: maxConcurrentSessions() in capacity-planner.ts

Step 6: NAND Slots (Parked Sessions)

What: How many additional sessions can be parked on SSD. Formula:

total_nand = nand_per_gpu × gpu_count
nand_slots = floor(total_nand / kv_per_session)

Calculation (4 TB NAND per GPU, 16K context, fp8):

total_nand = 4,000 GB × 8 = 32,000 GB
nand_slots = floor(32,000 / 2.5) = 12,800 parked sessions
total_sessions = 56 (HBM) + 12,800 (NAND) = 12,856

NAND massively expands capacity. But those 12,800 sessions are parked — they need restore latency to become active. Code: Lines 32-36 in capacity-planner.ts

Step 7: TPOT (Decode Latency)

What: How long each output token takes to generate. Why bandwidth-bound: Each decode step must read the entire KV cache for all active sequences from HBM. The GPU compute is idle waiting for memory. Formula:

total_bytes = context_tokens × batch_size × kv_bytes_per_token
bandwidth = gpu_bandwidth_TBs × gpu_count × 10^12  (convert to bytes/sec)
tpot_ms = (total_bytes / bandwidth) × 1000

Calculation (16K context, 1 user, 8× RTX A5000):

total_bytes = 16,384 × 1 × 163,840 = 2,684,354,560 bytes
bandwidth   = 0.768 × 8 × 10^12 = 6,144,000,000,000 bytes/sec
tpot_ms     = (2,684,354,560 / 6,144,000,000,000) × 1000 = 0.44 ms

With 10 concurrent users (batch=10):

total_bytes = 16,384 × 10 × 163,840 = 26,843,545,600 bytes
tpot_ms     = (26,843,545,600 / 6,144,000,000,000) × 1000 = 4.37 ms

TPOT scales linearly with batch size. At 50ms SLA, you breach at ~114 concurrent users. Code: estimateTpot() in latency-estimator.ts

Step 8: Single Prefill Time

What: Time to process one prompt through all layers (no queue). Why compute-bound: Prefill runs the full attention computation (quadratic in prompt length) plus FFN (linear). The GPU compute units are saturated, not memory. Formula:

prefill_flops = (4 × N² × hidden_dim + 4 × N × ffn_dim) × layers
gpu_flops     = gpu_tflops × gpu_count × efficiency × 10^12
single_prefill_ms = (prefill_flops / gpu_flops) × 1000

The efficiency factor is 0.35 (35%) — real-world GPU utilization vs peak spec. This accounts for memory stalls, kernel launch overhead, and tensor parallelism communication. Calculation (4K prompt, 8× RTX A5000):

N = 4,096
prefill_flops = (4 × 4096² × 8192 + 4 × 4096 × 28672) × 80
              = (4 × 16,777,216 × 8192 + 4 × 4096 × 28672) × 80
              = (549,755,813,888 + 469,762,048) × 80
              = 550,225,575,936 × 80
              = 44,018,046,074,880 flops

gpu_flops     = 65 × 8 × 0.35 × 10^12 = 182,000,000,000,000 flops/sec

single_prefill = (44,018,046,074,880 / 182,000,000,000,000) × 1000 = 241.9 ms

Why 32K prompt is ~64× slower than 4K (not 8×): The quadratic attention term dominates. When N grows 8x, the attention cost grows 64x. This is why long-context prefill is so expensive. Code: singlePrefillMs() in latency-estimator.ts

Step 9: TTFT Under Load

What: How long a user waits for the first token when other users are also submitting prompts. Why it degrades: Prefills are serialized on the GPU compute path. With C concurrent users, each user’s prefill waits behind the others in a queue. Formula:

TTFT(C users) = single_prefill × (C + 1) / 2

The (C+1)/2 is the average queue position — if C users arrive simultaneously, a random user is at position 1 to C uniformly, so the average wait is (C+1)/2 prefills. Calculation (10 concurrent users, 4K prompt):

TTFT = 241.9 ms × (10 + 1) / 2 = 241.9 × 5.5 = 1,330 ms (1.3 seconds)

Calculation (100 concurrent users):

TTFT = 241.9 × (100 + 1) / 2 = 241.9 × 50.5 = 12,216 ms (12.2 seconds)

Code: estimateTtft() in latency-estimator.ts

Step 10: TTFT Breach Point

What: Maximum concurrent users before average TTFT exceeds the SLA. Formula (solving Step 9 for C):

single_prefill × (C + 1) / 2 = ttft_sla_ms
C = 2 × ttft_sla_ms / single_prefill - 1

Calculation (5 second SLA, 4K prompt):

C = 2 × 5000 / 241.9 - 1 = 41.3 - 1 = 40.3 → 41 users

At 41 concurrent users, the average TTFT hits 5 seconds. The 42nd user will experience over 5s wait. Important: Adding NAND does NOT change this number. NAND parks cold sessions but doesn’t add TFLOPS — the prefill queue bottleneck is compute, not memory. Code: ttftBreachPoint() in latency-estimator.ts

Step 11: Restore Latency

What: Time to wake up a cold session from NAND SSD. Formula:

per_gpu_kv = kv_per_session / gpu_count   (tensor parallel sharding)
restore_ms = (per_gpu_kv / nand_bandwidth) × 1000

Each GPU restores its own shard in parallel, so the KV per session is divided by GPU count. Calculation (16K context fp8, 8× GPU, Gen4 NVMe 7 GB/s):

kv_per_session = 2.5 GB
per_gpu_kv     = 2.5 / 8 = 0.3125 GB
restore_ms     = (0.3125 / 7) × 1000 = 44.6 ms

With parallel restore streams (4 sessions restoring simultaneously):

effective_bw = 7 / 4 = 1.75 GB/s per stream
restore_ms   = (0.3125 / 1.75) × 1000 = 178.6 ms per session

Code: restoreLatency() in latency-estimator.ts

Step 12: Monthly GPU Cost

What: Infrastructure cost estimate. Formula:

monthly_cost = gpu_count × price_per_hour × 730 hours/month

Calculation (8× RTX A5000 on-demand at $1.10/hr):

monthly_cost = 8 × $1.10 × 730 = $6,424/month

Per-slot cost:

cost_per_slot_per_day = $6,424 / 56 slots / 30 days = $3.82/day

Code: monthlyGpuCost() in infra-cost.ts

Step 13: Weighted Average Context (Workload Mix)

What: Converts the session type distribution into a single average context length. Formula:

avg_context = Σ(count_i × midpoint_i) / Σ(count_i)

Midpoints: light=35K, medium=130K, heavy=325K, extreme=1,250K Calculation (extreme=1, heavy=2, medium=3, light=4):

avg = (1×1,250,000 + 2×325,000 + 3×130,000 + 4×35,000) / (1+2+3+4)
    = (1,250,000 + 650,000 + 390,000 + 140,000) / 10
    = 2,430,000 / 10
    = 243,000 tokens

This weighted average drives all the session slot calculations in planCapacity(). Code: weightedAvgContext() in capacity-planner.ts

Full Worked Example Summary

Config: Llama 3.1 70B, 8× RTX A5000, int4 weights, fp8 KV, 16K avg context

Step	Calculation	Result
KV/token	2 × 80 × 8 × 128 × 1	160 KB
KV/session (16K)	16,384 × 160 KB	2.5 GB
Weights (int4)	140 × 0.25	35 GB
Total HBM	22.5 × 8	180 GB
Free for KV	180 - 35 - 5	140 GB
HBM slots	floor(140 / 2.5)	56
NAND slots (4TB/GPU)	floor(32,000 / 2.5)	12,800
TPOT (1 user)	2.68 GB / 6.14 TB/s	0.44 ms
Single prefill (4K)	44T flops / 182T flops/s	241.9 ms
TTFT (10 users)	241.9 × 5.5	1,330 ms
TTFT breach (5s SLA)	2×5000/241.9 - 1	41 users
Restore (Gen4 NVMe)	0.3125 GB / 7 GB/s	44.6 ms
Monthly cost	8 × $1.10 × 730	$6,424

The CapacityPlan Object

The planCapacity() function returns a complete CapacityPlan with every metric:

Field	Type	Description
`model`	`ModelArchitecture`	The model being planned for
`hardware`	`HardwareConfig`	GPU setup (type, count, NAND)
`kvPrecision`	`KvPrecision`	KV cache precision used
`weightPrecision`	`WeightPrecision`	Weight quantization used
`totalHbmGb`	`number`	Total GPU memory across all GPUs
`weightMemoryGb`	`number`	Memory consumed by model weights
`freeHbmForKvGb`	`number`	HBM available for KV cache after weights + overhead
`kvBytesPerToken`	`number`	Bytes per token in KV cache
`hbmSlots`	`number`	Concurrent sessions fitting in HBM
`nandSlots`	`number`	Sessions parkable on NAND SSD
`totalSessions`	`number`	hbmSlots + nandSlots
`tpotMs`	`number`	Estimated Time Per Output Token (ms)
`ttftMs`	`number`	Estimated Time To First Token (ms)
`restoreLatencyMs`	`number \| null`	NAND → HBM restore time (null if no NAND)
`ttftBreachPoint`	`number`	Max concurrent users before 5s TTFT SLA breach
`monthlyGpuCostUsd`	`number`	Estimated monthly GPU infrastructure cost

Interactive Dashboard

The apps/capacity-planner/ Next.js app provides a full interactive UI with:

Model selector (all 15 architectures)
GPU type, count, NAND per GPU sliders
KV/weight precision selectors
Workload controls (avg context, cold ratio, SLA thresholds)
Per-GPU breakdown panel (shows free HBM + NAND per card)
6 interactive charts:
- Users vs GPUs — session capacity scaling with GPU count
- Users vs Context — how capacity drops as context grows
- TPOT vs Users — decode latency at different context sizes
- TTFT vs Users — prefill queue congestion with SLA breach markers
- Restore Budget — NAND restore time at Gen4/Gen5 bandwidth
- GPU vs NAND — total sessions across NAND sizes

cd apps/capacity-planner
npm install
npm run dev
# → http://localhost:3000

KV Estimator — kvBytesPerToken, kvCacheForContext, maxContextForMemory, weightMemory
Capacity Planner — planCapacity, maxConcurrentSessions, estimateGpuCount
Latency Estimator — estimateTpot, estimateTtft, ttftBreachPoint, restoreLatency
Session Profiler — Runtime monitoring with EventBus integration
Edge GPU Monitoring — Real-time GPU memory, utilization, and temperature via nvidia-smi
Observability Metrics — Prometheus counters for KV cache and session categories
Examples — Live monitor, config comparison, KV sizing

​Capacity Planning

​Quick Start

​Glossary

​KV Cache (Key-Value Cache)

​KV Bytes Per Token

​Attention Types

​Layers

​Head Dimension

​Hidden Dimension

​FFN Dimension

​HBM (High Bandwidth Memory)

​HBM Slots

​Weight Memory

​NAND SSD Offloading

​NAND Slots

​Restore Latency

​Cold Ratio

​TPOT (Time Per Output Token)

​TTFT (Time To First Token)

​TTFT Breach Point

​Single Prefill Time

​Prefix Caching / Prefix Hit Rate

​Tensor Parallelism

​Workload Mix

​Session Category Thresholds

​Overhead

​Precision Options

​KV Precision

​Weight Precision

​Model Architectures

​GPU Specs

​How the Math Works — Step by Step

​Step 1: KV Bytes Per Token

​Step 2: KV Cache Per Session

​Step 3: Weight Memory

​Step 4: Free HBM for KV Cache

​Step 5: HBM Slots (Active Sessions)

​Step 6: NAND Slots (Parked Sessions)

​Step 7: TPOT (Decode Latency)

​Step 8: Single Prefill Time

​Step 9: TTFT Under Load

​Step 10: TTFT Breach Point

​Step 11: Restore Latency

​Step 12: Monthly GPU Cost

​Step 13: Weighted Average Context (Workload Mix)

​Full Worked Example Summary

​The CapacityPlan Object

​Interactive Dashboard

​Related