The brief: build a self-hosted inference service that serves Llama 3.3 70B to 200 concurrent requests with end-to-end p95 < 4 seconds for a typical generation of 500 input + 200 output tokens. The service must expose an OpenAI-compatible API so existing client SDKs work unchanged, support streaming, and autoscale to absorb 4× bursts without breaking SLO.
Self-hosted LLM serving is a memory-bandwidth problem dressed up as a compute problem. The decisions that actually matter are quantization, K/V cache sizing, batching strategy, and admission control — in that order. GPU choice follows from those. This walk-through is opinionated about all of them.
POST /v1/chat/completions compatible with the OpenAI
SDK so applications written against OpenAI work unchanged.| Metric | Target |
|---|---|
| Time-to-first-token (TTFT) p95 | 600 ms |
| Per-token latency (TPOT) p95 | 40 ms (= 25 tok/s/user) |
| End-to-end p95 (500 in + 200 out) | 4,000 ms |
| Concurrent requests | 200 sustained, 800 burst |
| Aggregate token throughput | 5,000 tok/sec sustained |
| Availability | 99.9% per region |
| GPU cost target | < $1.50 per 1M tokens |
The two metrics that matter for user-perceived quality are TTFT (time until the first token streams) and TPOT (time between subsequent tokens). End-to-end latency is a derived metric — under load, TPOT degrades faster than TTFT, so optimize batching with TPOT as the target.
Model weights memory. Llama 3.3 70B in FP8 is ~70 GB. Add ~5 GB for activations and runtime overhead. Single H100 80GB fits with ~5 GB for K/V cache — too tight. Use 2×H100 with tensor parallelism, or single H100 with FP4 / AWQ-INT4 (which we will not pick because of quality loss on long contexts).
K/V cache per token. Llama 3.3 70B uses GQA with 8 K/V heads, hidden dim 8192 split across 64 attention heads (so 128 dim per head), 80 layers. K/V cache per token in FP8:
# K/V cache size per token, per layer
# = 2 (for K and V) * num_kv_heads * head_dim * dtype_bytes
kv_per_token_per_layer = 2 * 8 * 128 * 1 # FP8 = 1 byte
# Total across all 80 layers
kv_per_token = kv_per_token_per_layer * 80
# = 163,840 bytes = 160 KB per token
# At 4096 token avg context (input + output):
kv_per_request = 160 * 1024 * 4096 # = 640 MB per concurrent request
# K/V cache pool needed for 200 concurrent:
total_kv = 640 * 200 # = 128 GB
This is the critical sizing constraint. With 2×H100 (160 GB total) - 70 GB weights = 90 GB free, K/V cache holds < 150 concurrent at full context. To hit 200 concurrent we need either: (a) longer context per fewer requests, (b) 4×H100 with TP=4 (much more headroom), (c) PagedAttention (vLLM) which dramatically improves K/V cache utilization by avoiding fragmentation, typically buying ~2× effective concurrency on the same hardware.
Throughput math. A single 2×H100 node running vLLM with Llama 3.3 70B FP8 yields ~3,500 tokens/sec aggregate at batch ~32 (measured; varies with input length). For 5,000 tok/sec sustained with headroom:
target_throughput: 5000 tok/sec
node_throughput: 3500 tok/sec # 2xH100, vLLM, FP8, batch=32
nodes_for_steady_state: 2 (rounded up from 1.43)
warm_spare_per_az: 1
total_nodes: 2 + 1 = 3 nodes (= 6 H100s)
burst_capacity: additional 2 nodes via autoscale on warm pool (2 min cold start)
Clients (OpenAI SDK, langchain, custom)
|
v
+---------------------------+
| L7 Load Balancer (Envoy) | TLS, JWT, OpenAI-compatible routing
+-------------+-------------+
|
v
+---------------------------+
| API Gateway | per-tenant rate limit, admission control,
| (FastAPI / Go) | queue-depth aware, OpenAI schema validation
+-------------+-------------+
|
sticky-by-conversation
|
+--------+--------+----------+
| | |
v v v
+---------+ +---------+ +---------+
| Node 1 | | Node 2 | | Node 3 | each = 2xH100, vLLM, Llama 70B FP8
| vLLM | | vLLM | | warm |
| TP=2 | | TP=2 | | spare |
+---------+ +---------+ +---------+
\ | /
+-------+-------+-------+------+
|
v
+---------------------------+
| Telemetry sink | token throughput, queue latency,
| (OTel + Prometheus) | cache hit rate, GPU mem util
+---------------------------+
Out-of-band:
- HF model registry ----> object store (S3) ----> node startup hydration
- Autoscaler watches queue_depth; scales between 2-8 nodes
Component responsibilities:
POST /v1/chat/completions to the gateway.Engine: vLLM. Continuous batching, PagedAttention, broad model support, the most mature OSS option. TGI is a good alternative with similar features and HF's backing. SGLang has a faster prefix-caching implementation that wins on multi-turn workloads, but the ecosystem is younger. We pick vLLM for the project.
Quantization: FP8. On H100, FP8 has hardware acceleration via the Transformer Engine; the quality loss vs FP16 is < 1% on typical chat benchmarks, and the throughput gain is ~30%. AWQ-INT4 squeezes the model into a single H100 at the cost of 3–5% quality and notably worse long-context behavior. FP8 is the right default. AWQ-INT4 is reserved for budget tiers where quality is explicitly downgraded.
Continuous batching parameters.
# vllm serve config
model: meta-llama/Llama-3.3-70B-Instruct
tensor-parallel-size: 2
quantization: fp8
kv-cache-dtype: fp8
# Continuous batching: vLLM picks up new requests at every decode step,
# avoiding the head-of-line blocking that plagues static batching.
max-num-seqs: 64 # concurrent active sequences in the batch
max-num-batched-tokens: 8192 # token budget per scheduling step
max-model-len: 8192 # context window cap (matches our deployed prompts)
# Memory: leave 5GB headroom for activations
gpu-memory-utilization: 0.92
swap-space: 4 # GB CPU swap for evicted K/V (used rarely)
# Scheduler tuning
enable-prefix-caching: true # huge win on shared system prompts
enable-chunked-prefill: true # avoid TTFT spikes from long prompts
preemption-mode: recompute # prefer recompute over swap on K/V eviction
Why these numbers. max-num-seqs=64 is bounded by
K/V cache memory (above this, we evict and TPOT spikes). max-num-batched-tokens=8192
is the throughput-vs-latency knob: larger means higher throughput but worse
TTFT for late arrivals. Prefix caching pays off massively when 95% of requests
share an identical 200-token system prompt — TTFT drops from 300 ms to
~50 ms because the K/V for the prefix is already computed.
Speculative decoding. Use a smaller draft model (Llama 3.2 1B or 3B) to predict 4–8 tokens at a time, verified by the 70B in parallel. Roughly doubles TPOT throughput at the cost of one additional GPU per node for the draft. We enable it after the basic system is stable, not on day one.
@app.post("/v1/chat/completions")
async def chat_completions(req: ChatCompletionRequest, tenant: Tenant = Depends(auth)):
# 1. Quota check (~2 ms, in-process token bucket reconciled every 100ms)
if not tenant.bucket.try_acquire(estimate_tokens(req)):
raise HTTPException(429, headers={"Retry-After": "1"})
# 2. Admission control: refuse if cluster is overloaded
if cluster.queue_depth() > ADMISSION_LIMIT:
raise HTTPException(503, "queue full, retry shortly")
# 3. Pick a node weighted by current queue_depth (least-loaded)
node = router.pick(cluster.healthy_nodes())
# 4. Forward to vLLM, stream tokens back
async def stream():
first_token_time = None
async for chunk in node.client.chat.completions.create(**req.dict(), stream=True):
if first_token_time is None:
first_token_time = time.monotonic()
metrics.ttft.observe(first_token_time - request_start)
yield f"data: {chunk.json()}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(stream(), media_type="text/event-stream")
Inside the chosen node, vLLM's continuous batcher does the real work:
max-num-seqs and max-num-batched-tokens.Autoscaling signal: queue depth, not GPU utilization. GPU utilization is a misleading metric for batched LLM serving — it can be 80% on perfectly happy traffic or on traffic that's about to violate SLO. The right signal is scheduler queue depth exposed by vLLM:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-llama-70b
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 2 # always keep 2 nodes hot
maxReplicaCount: 8 # cost ceiling
cooldownPeriod: 300 # don't scale down for 5 min after scale up
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_queue_depth
threshold: "20" # scale up when avg queue per node > 20
query: avg(vllm_num_requests_waiting) by (deployment)
Warm pool. Cold start of a new vLLM node = 90 s download + 60 s model load = ~150 s. Far too slow for burst absorption. We keep one "warm spare" node per AZ — model loaded, registered with the load balancer but excluded from routing. On scale-up trigger, the spare gets included in routing instantly (~5 s); a new actual cold node starts in the background to replenish the spare.
Admission control. When queue_depth across the cluster
exceeds a threshold, the gateway returns 503 with a small Retry-After.
Refusing fast is dramatically better than queueing for 30 s and then timing
out client-side; the 503 lets the caller back off and the cluster catch up.
Bottlenecks at scale, in order:
RequestOutput.finished = "abort"; gateway returns 500 to the
client and tracks the prompt for repro. We do not crash the whole node.Idempotency. The OpenAI API is request/response, so callers
that retry can produce duplicate generations. We support an optional
X-Request-Id header; gateway dedupes via Redis with 5-min TTL.
The full streamed response is cached for replay; rare in practice but cheap
to support.
Per 1M tokens generated, baseline 3 nodes (= 6 H100s):
h100_lease: $2.00/hr each
nodes_baseline: 3 (= 6 H100s = $12/hr)
hourly_cost: $12
sustained_throughput: 5,000 tok/sec = 18M tok/hr
cost_per_1M_tokens: $12 / 18 = $0.67
# add overhead
gateway_compute: ~$0.20 per 1M
storage_egress: ~$0.10
observability: ~$0.10
---
total_per_1M: ~$1.07
target: < $1.50 -- well under
# compare bedrock haiku at $0.25 in / $1.25 out:
# blended ~$0.50 per 1M tokens depending on input/output ratio
# verdict: self-hosted Llama 70B is ~2x cost of Haiku but
# delivers larger model + full data sovereignty
The economic case for self-hosted Llama 70B is not "cheaper per token" — it usually isn't, vs Haiku-class managed APIs. The case is: data sovereignty, predictable cost (no surprise bills on usage spikes), control over model upgrades, and access to the larger 70B class without paying Sonnet pricing.
vLLM vs TGI vs SGLang. vLLM is the safe production choice: biggest community, most mature scheduler, OpenAI-compatible out of the box. TGI matches it on most workloads with cleaner ops. SGLang is the throughput king for multi-turn (radix-tree prefix cache is genuinely faster) but production tooling is younger. I'd ship vLLM, evaluate SGLang on the next quarter's benchmark.
FP8 vs AWQ-INT4 vs FP16. FP8 wins on H100: hardware-accelerated matmul, < 1% quality loss, 30% throughput gain over FP16. AWQ-INT4 lets you fit 70B on a single H100 but loses 3–5% on long contexts — bad trade for our use case. FP16 is the reference quality but needs the same 2×H100 footprint without the throughput gain. FP8 is the right default.
H100 vs L40S vs A100 vs MI300X. H100 is the production-safe pick: best raw perf, hardware FP8, mature drivers. L40S is dramatically cheaper (~$1/hr) but throughput is ~40% of H100 and FP8 is software-emulated; viable for a 13B model, painful for 70B. A100 is fine if you can get it cheap, but no FP8 hardware support means you give up the 30% throughput. MI300X has more HBM and competitive specs but the ROCm-vLLM port is still maturing; I would not bet a production launch on it today.
Continuous batching vs static batching vs request-batching. Continuous batching is strictly better for streaming chat workloads — no head-of-line blocking, every decode step is fully utilized. Static batching is only sane for batch-job offline scoring. Don't ship anything else for an interactive service.
Self-hosted vs Bedrock/Together/Anyscale. Managed providers are cheaper to operate by a wide margin if you don't need data sovereignty. The honest answer for most companies is "use Bedrock for the easy 90%, run self-hosted for the 10% that has data residency or compliance constraints." Pretending self-hosted will save money on token cost alone is the most common mistake; the savings come from compliance, predictability, and avoiding provider lock-in — not the $/token rate.
Speculative decoding now or later. Speculative decoding adds operational complexity (a second model to load, more GPU memory, more ways to fail). It can roughly 2× TPOT throughput when the draft model is well-aligned. Ship it after the base system is stable for ~30 days; the gain is real but not worth the launch risk.
GPU utilization measures whether the SMs are busy, not whether the system is meeting SLO. A vLLM node can run at 80% GPU util while p95 TPOT is fine, or at 80% util while every request is queueing 30 s. Queue depth measures the actual user-facing problem: how many requests are waiting for a decode step. Scaling on queue depth means we add capacity when users are about to feel slowness, which is the only useful definition of "load."
Naive K/V cache allocates a contiguous block per request sized for the max sequence length. Most requests don't reach max length, so most of that block is wasted — classic internal fragmentation. PagedAttention treats the K/V cache as a pool of fixed-size "pages" (typically 16 tokens each) and allocates pages to requests on demand. Memory utilization goes from ~30% to ~95% for typical chat workloads, which roughly translates to 2× effective concurrency on the same GPU.
Two layers. The gateway caps max_tokens per tenant tier; an
SMB tenant might be limited to 1024, an enterprise tenant to 4096. The vLLM
node enforces a global hard cap via max-model-len — even
if a request slips through asking for 16k, the engine truncates. Long-output
requests still consume K/V cache for their full duration; they get scheduled
against the same batch as short requests, so the impact is bounded. We do
not pre-allocate based on max_tokens.
Run the new model on a separate node pool labeled model-version=v2;
the gateway routes a configurable percentage of requests there based on
tenant or sticky session. Telemetry is tagged with the version so dashboards
compare TTFT, TPOT, and quality metrics. Promote v2 by flipping the routing
percentage to 100% and decommissioning the v1 pool. Hot-swap inside a single
vLLM node is technically possible but risky — we prefer separate pools
because rollback is just a routing change.
First, check whether spikes correlate with prefill chunks being scheduled
— long prompts entering the batch can briefly delay decode for everyone.
The fix is enable-chunked-prefill: true which we already have
on. Second, check K/V cache occupancy at the spike: if the cache is full,
new requests are waiting for slots, not for compute. Third, check whether
the spike correlates with cold starts (autoscaler added a node mid-burst);
the new node accepted traffic before model load completed. Each has a
distinct fingerprint in the metrics; the right tool is grouped traces tagged
by tenant, prompt length, and node ID.
(1) Client SDK sends POST /v1/chat/completions with stream=true
to our LB. (2) Envoy validates TLS, extracts JWT, routes to an API Gateway pod.
(3) Gateway authorizes the tenant, checks the per-tenant quota bucket, checks
cluster admission. (4) Gateway picks a vLLM node by least-loaded (free K/V
slots, not request count). (5) Gateway opens a streaming connection to vLLM's
OpenAI endpoint and starts forwarding the SSE stream back to the client.
(6) On the vLLM node, the request enters the scheduler. If a prefix-cache hit
exists for the system prompt (likely), only the new tokens need prefill. (7)
Chunked prefill processes the input over a few decode steps. (8) Once prefill
finishes, the first decoded token is generated and immediately streamed to
the gateway, which streams to the client. (9) End-to-end TTFT: ~300 ms for
short prompts, ~600 ms for long prompts — under the 600 ms p95 target.
(10) Subsequent tokens stream at ~25 tok/sec until EOS, with TPOT under 40 ms.