Self-Hosted LLM Inference Service: Llama 3.3 70B at 200 Concurrent, p95 < 4s

The brief: build a self-hosted inference service that serves Llama 3.3 70B to 200 concurrent requests with end-to-end p95 < 4 seconds for a typical generation of 500 input + 200 output tokens. The service must expose an OpenAI-compatible API so existing client SDKs work unchanged, support streaming, and autoscale to absorb 4× bursts without breaking SLO.

Self-hosted LLM serving is a memory-bandwidth problem dressed up as a compute problem. The decisions that actually matter are quantization, K/V cache sizing, batching strategy, and admission control — in that order. GPU choice follows from those. This walk-through is opinionated about all of them.

1. Problem & Functional Requirements

Serve Llama 3.3 70B (chat-tuned) on dedicated GPUs in our own VPC.
Expose POST /v1/chat/completions compatible with the OpenAI SDK so applications written against OpenAI work unchanged.
Support streaming via Server-Sent Events.
Support function calling (tool use) using the model's native template.
Per-tenant API keys with quota and observability.
Hot-swap the served model for A/B tests without dropping traffic.

2. Non-Functional Requirements & SLOs

Metric	Target
Time-to-first-token (TTFT) p95	600 ms
Per-token latency (TPOT) p95	40 ms (= 25 tok/s/user)
End-to-end p95 (500 in + 200 out)	4,000 ms
Concurrent requests	200 sustained, 800 burst
Aggregate token throughput	5,000 tok/sec sustained
Availability	99.9% per region
GPU cost target	< $1.50 per 1M tokens

The two metrics that matter for user-perceived quality are TTFT (time until the first token streams) and TPOT (time between subsequent tokens). End-to-end latency is a derived metric — under load, TPOT degrades faster than TTFT, so optimize batching with TPOT as the target.

3. Capacity Estimates & K/V Cache Math

Model weights memory. Llama 3.3 70B in FP8 is ~70 GB. Add ~5 GB for activations and runtime overhead. Single H100 80GB fits with ~5 GB for K/V cache — too tight. Use 2×H100 with tensor parallelism, or single H100 with FP4 / AWQ-INT4 (which we will not pick because of quality loss on long contexts).

K/V cache per token. Llama 3.3 70B uses GQA with 8 K/V heads, hidden dim 8192 split across 64 attention heads (so 128 dim per head), 80 layers. K/V cache per token in FP8:

# K/V cache size per token, per layer
# = 2 (for K and V) * num_kv_heads * head_dim * dtype_bytes
kv_per_token_per_layer = 2 * 8 * 128 * 1   # FP8 = 1 byte
# Total across all 80 layers
kv_per_token = kv_per_token_per_layer * 80
# = 163,840 bytes = 160 KB per token

# At 4096 token avg context (input + output):
kv_per_request = 160 * 1024 * 4096  # = 640 MB per concurrent request

# K/V cache pool needed for 200 concurrent:
total_kv = 640 * 200  # = 128 GB

This is the critical sizing constraint. With 2×H100 (160 GB total) - 70 GB weights = 90 GB free, K/V cache holds < 150 concurrent at full context. To hit 200 concurrent we need either: (a) longer context per fewer requests, (b) 4×H100 with TP=4 (much more headroom), (c) PagedAttention (vLLM) which dramatically improves K/V cache utilization by avoiding fragmentation, typically buying ~2× effective concurrency on the same hardware.

Throughput math. A single 2×H100 node running vLLM with Llama 3.3 70B FP8 yields ~3,500 tokens/sec aggregate at batch ~32 (measured; varies with input length). For 5,000 tok/sec sustained with headroom:

target_throughput:        5000 tok/sec
node_throughput:          3500 tok/sec  # 2xH100, vLLM, FP8, batch=32
nodes_for_steady_state:   2 (rounded up from 1.43)
warm_spare_per_az:        1
total_nodes:              2 + 1 = 3 nodes (= 6 H100s)
burst_capacity:           additional 2 nodes via autoscale on warm pool (2 min cold start)

4. High-Level Architecture

  Clients (OpenAI SDK, langchain, custom)
            |
            v
  +---------------------------+
  |  L7 Load Balancer (Envoy) |   TLS, JWT, OpenAI-compatible routing
  +-------------+-------------+
                |
                v
  +---------------------------+
  |  API Gateway              |   per-tenant rate limit, admission control,
  |  (FastAPI / Go)           |   queue-depth aware, OpenAI schema validation
  +-------------+-------------+
                |
       sticky-by-conversation
                |
       +--------+--------+----------+
       |                 |          |
       v                 v          v
  +---------+      +---------+  +---------+
  | Node 1  |      | Node 2  |  | Node 3  |     each = 2xH100, vLLM, Llama 70B FP8
  | vLLM    |      | vLLM    |  | warm    |
  | TP=2    |      | TP=2    |  | spare   |
  +---------+      +---------+  +---------+
       \                |               /
        +-------+-------+-------+------+
                |
                v
  +---------------------------+
  |  Telemetry sink           |   token throughput, queue latency,
  |  (OTel + Prometheus)      |   cache hit rate, GPU mem util
  +---------------------------+

  Out-of-band:
  - HF model registry  ----> object store (S3)  ----> node startup hydration
  - Autoscaler watches queue_depth; scales between 2-8 nodes

Component responsibilities:

L7 Load Balancer — Envoy. Terminates TLS, validates JWT, routes POST /v1/chat/completions to the gateway.
API Gateway — FastAPI service. Validates request against OpenAI JSON schema, enforces tenant quota, performs admission control based on observed queue depth across nodes, picks a backend node.
Inference Nodes — each runs vLLM with TP=2 across a pair of H100s. Exposes vLLM's OpenAI-compatible HTTP interface.
Warm Spare — one node per AZ kept warm (model loaded, no traffic) for instant burst absorption.
Telemetry Sink — OTel collector aggregates per-request spans and per-node GPU metrics; Prometheus stores time-series.
Autoscaler — KEDA or custom controller, scales node count on queue_depth metric.

5. Engine Choice, Quantization & Batching

Engine: vLLM. Continuous batching, PagedAttention, broad model support, the most mature OSS option. TGI is a good alternative with similar features and HF's backing. SGLang has a faster prefix-caching implementation that wins on multi-turn workloads, but the ecosystem is younger. We pick vLLM for the project.

Quantization: FP8. On H100, FP8 has hardware acceleration via the Transformer Engine; the quality loss vs FP16 is < 1% on typical chat benchmarks, and the throughput gain is ~30%. AWQ-INT4 squeezes the model into a single H100 at the cost of 3–5% quality and notably worse long-context behavior. FP8 is the right default. AWQ-INT4 is reserved for budget tiers where quality is explicitly downgraded.

Continuous batching parameters.

# vllm serve config
model: meta-llama/Llama-3.3-70B-Instruct
tensor-parallel-size: 2
quantization: fp8
kv-cache-dtype: fp8

# Continuous batching: vLLM picks up new requests at every decode step,
# avoiding the head-of-line blocking that plagues static batching.
max-num-seqs: 64           # concurrent active sequences in the batch
max-num-batched-tokens: 8192   # token budget per scheduling step
max-model-len: 8192        # context window cap (matches our deployed prompts)

# Memory: leave 5GB headroom for activations
gpu-memory-utilization: 0.92
swap-space: 4              # GB CPU swap for evicted K/V (used rarely)

# Scheduler tuning
enable-prefix-caching: true   # huge win on shared system prompts
enable-chunked-prefill: true  # avoid TTFT spikes from long prompts
preemption-mode: recompute    # prefer recompute over swap on K/V eviction

Why these numbers. max-num-seqs=64 is bounded by K/V cache memory (above this, we evict and TPOT spikes). max-num-batched-tokens=8192 is the throughput-vs-latency knob: larger means higher throughput but worse TTFT for late arrivals. Prefix caching pays off massively when 95% of requests share an identical 200-token system prompt — TTFT drops from 300 ms to ~50 ms because the K/V for the prefix is already computed.

Speculative decoding. Use a smaller draft model (Llama 3.2 1B or 3B) to predict 4–8 tokens at a time, verified by the 70B in parallel. Roughly doubles TPOT throughput at the cost of one additional GPU per node for the draft. We enable it after the basic system is stable, not on day one.

6. Critical Path: Request Lifecycle

@app.post("/v1/chat/completions")
async def chat_completions(req: ChatCompletionRequest, tenant: Tenant = Depends(auth)):
    # 1. Quota check (~2 ms, in-process token bucket reconciled every 100ms)
    if not tenant.bucket.try_acquire(estimate_tokens(req)):
        raise HTTPException(429, headers={"Retry-After": "1"})

    # 2. Admission control: refuse if cluster is overloaded
    if cluster.queue_depth() > ADMISSION_LIMIT:
        raise HTTPException(503, "queue full, retry shortly")

    # 3. Pick a node weighted by current queue_depth (least-loaded)
    node = router.pick(cluster.healthy_nodes())

    # 4. Forward to vLLM, stream tokens back
    async def stream():
        first_token_time = None
        async for chunk in node.client.chat.completions.create(**req.dict(), stream=True):
            if first_token_time is None:
                first_token_time = time.monotonic()
                metrics.ttft.observe(first_token_time - request_start)
            yield f"data: {chunk.json()}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(stream(), media_type="text/event-stream")

Inside the chosen node, vLLM's continuous batcher does the real work:

Request enters the scheduler queue.
At each decode step (every ~25 ms), the scheduler picks running sequences from the queue up to max-num-seqs and max-num-batched-tokens.
If the request is in prefill phase (computing K/V for input), it joins the batch as a chunked prefill (one chunk of 512 tokens per step) so other generation steps still run.
Once prefill completes, the request enters decode phase — one token per step, batched with all other decoding sequences.
Token returned to gateway, streamed to client. Sequence stays in batch until EOS or max length.

7. Scaling, Autoscaling & Admission Control

Autoscaling signal: queue depth, not GPU utilization. GPU utilization is a misleading metric for batched LLM serving — it can be 80% on perfectly happy traffic or on traffic that's about to violate SLO. The right signal is scheduler queue depth exposed by vLLM:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-llama-70b
spec:
  scaleTargetRef:
    name: vllm-deployment
  minReplicaCount: 2          # always keep 2 nodes hot
  maxReplicaCount: 8          # cost ceiling
  cooldownPeriod: 300         # don't scale down for 5 min after scale up
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_queue_depth
        threshold: "20"        # scale up when avg queue per node > 20
        query: avg(vllm_num_requests_waiting) by (deployment)

Warm pool. Cold start of a new vLLM node = 90 s download + 60 s model load = ~150 s. Far too slow for burst absorption. We keep one "warm spare" node per AZ — model loaded, registered with the load balancer but excluded from routing. On scale-up trigger, the spare gets included in routing instantly (~5 s); a new actual cold node starts in the background to replenish the spare.

Admission control. When queue_depth across the cluster exceeds a threshold, the gateway returns 503 with a small Retry-After. Refusing fast is dramatically better than queueing for 30 s and then timing out client-side; the 503 lets the caller back off and the cluster catch up.

Bottlenecks at scale, in order:

K/V cache memory. The first wall. Mitigations: PagedAttention (already on), per-conversation prefix caching, shorter max context, FP8 K/V cache.
Prefill compute on long prompts. 8k-token input takes ~1.5 s prefill on 2×H100. Mitigations: chunked prefill, caching of common prefixes (system prompts, RAG retrieved chunks).
Inter-node load imbalance. If router picks naively, one node's K/V cache fills and starts evicting while others sit idle. Mitigation: least-loaded routing weighted by free K/V slots, not by request count.
Network egress for streaming. 5,000 tok/sec across 200 streams = 200 SSE connections each at ~50 tok/sec; trivial bandwidth but real ELB connection-limit pressure. Use HTTP/2 or HTTP/3 to multiplex.

8. Failure Modes & Resilience

vLLM CUDA OOM. Can happen on a long-prompt edge case if K/V cache estimation was wrong. vLLM aborts the offending request with RequestOutput.finished = "abort"; gateway returns 500 to the client and tracks the prompt for repro. We do not crash the whole node.
NCCL hang on tensor parallel. If one of the two GPUs in a TP pair stops responding, NCCL hangs the entire process. Mitigation: aggressive process-level health check (token output rate < 1 tok/sec for 30 s = sick), auto-restart with traffic drained first.
Single GPU failure. Treat the 2-GPU node as one unit; when one card fails ECC or thermal, fail the whole node and restart on new hardware. Rare but happens at scale; the warm spare absorbs it.
Gateway down. Stateless; the load balancer routes around it. Multiple gateway replicas behind the LB.
Cluster overloaded. Admission control returns 503; autoscaler adds nodes; the warm spare absorbs the burst. If load persists above max replicas, we accept SLO breach for the tail and page on-call.

Idempotency. The OpenAI API is request/response, so callers that retry can produce duplicate generations. We support an optional X-Request-Id header; gateway dedupes via Redis with 5-min TTL. The full streamed response is cached for replay; rare in practice but cheap to support.

9. Cost Analysis

Per 1M tokens generated, baseline 3 nodes (= 6 H100s):

h100_lease:           $2.00/hr each
nodes_baseline:       3 (= 6 H100s = $12/hr)
hourly_cost:          $12
sustained_throughput: 5,000 tok/sec = 18M tok/hr
cost_per_1M_tokens:   $12 / 18 = $0.67

# add overhead
gateway_compute:      ~$0.20 per 1M
storage_egress:       ~$0.10
observability:        ~$0.10
---
total_per_1M:         ~$1.07
target:               < $1.50  -- well under

# compare bedrock haiku at $0.25 in / $1.25 out:
# blended ~$0.50 per 1M tokens depending on input/output ratio
# verdict: self-hosted Llama 70B is ~2x cost of Haiku but
# delivers larger model + full data sovereignty

The economic case for self-hosted Llama 70B is not "cheaper per token" — it usually isn't, vs Haiku-class managed APIs. The case is: data sovereignty, predictable cost (no surprise bills on usage spikes), control over model upgrades, and access to the larger 70B class without paying Sonnet pricing.

10. Tradeoffs & Alternatives

vLLM vs TGI vs SGLang. vLLM is the safe production choice: biggest community, most mature scheduler, OpenAI-compatible out of the box. TGI matches it on most workloads with cleaner ops. SGLang is the throughput king for multi-turn (radix-tree prefix cache is genuinely faster) but production tooling is younger. I'd ship vLLM, evaluate SGLang on the next quarter's benchmark.

FP8 vs AWQ-INT4 vs FP16. FP8 wins on H100: hardware-accelerated matmul, < 1% quality loss, 30% throughput gain over FP16. AWQ-INT4 lets you fit 70B on a single H100 but loses 3–5% on long contexts — bad trade for our use case. FP16 is the reference quality but needs the same 2×H100 footprint without the throughput gain. FP8 is the right default.

H100 vs L40S vs A100 vs MI300X. H100 is the production-safe pick: best raw perf, hardware FP8, mature drivers. L40S is dramatically cheaper (~$1/hr) but throughput is ~40% of H100 and FP8 is software-emulated; viable for a 13B model, painful for 70B. A100 is fine if you can get it cheap, but no FP8 hardware support means you give up the 30% throughput. MI300X has more HBM and competitive specs but the ROCm-vLLM port is still maturing; I would not bet a production launch on it today.

Continuous batching vs static batching vs request-batching. Continuous batching is strictly better for streaming chat workloads — no head-of-line blocking, every decode step is fully utilized. Static batching is only sane for batch-job offline scoring. Don't ship anything else for an interactive service.

Self-hosted vs Bedrock/Together/Anyscale. Managed providers are cheaper to operate by a wide margin if you don't need data sovereignty. The honest answer for most companies is "use Bedrock for the easy 90%, run self-hosted for the 10% that has data residency or compliance constraints." Pretending self-hosted will save money on token cost alone is the most common mistake; the savings come from compliance, predictability, and avoiding provider lock-in — not the $/token rate.

Speculative decoding now or later. Speculative decoding adds operational complexity (a second model to load, more GPU memory, more ways to fail). It can roughly 2× TPOT throughput when the draft model is well-aligned. Ship it after the base system is stable for ~30 days; the gain is real but not worth the launch risk.

11. Common Interview Q&A

Q1: Why is queue depth a better autoscaling signal than GPU utilization?

GPU utilization measures whether the SMs are busy, not whether the system is meeting SLO. A vLLM node can run at 80% GPU util while p95 TPOT is fine, or at 80% util while every request is queueing 30 s. Queue depth measures the actual user-facing problem: how many requests are waiting for a decode step. Scaling on queue depth means we add capacity when users are about to feel slowness, which is the only useful definition of "load."

Q2: Walk me through how vLLM's PagedAttention saves memory.

Naive K/V cache allocates a contiguous block per request sized for the max sequence length. Most requests don't reach max length, so most of that block is wasted — classic internal fragmentation. PagedAttention treats the K/V cache as a pool of fixed-size "pages" (typically 16 tokens each) and allocates pages to requests on demand. Memory utilization goes from ~30% to ~95% for typical chat workloads, which roughly translates to 2× effective concurrency on the same GPU.

Q3: How do you handle a request that asks for 8k output tokens?

Two layers. The gateway caps max_tokens per tenant tier; an SMB tenant might be limited to 1024, an enterprise tenant to 4096. The vLLM node enforces a global hard cap via max-model-len — even if a request slips through asking for 16k, the engine truncates. Long-output requests still consume K/V cache for their full duration; they get scheduled against the same batch as short requests, so the impact is bounded. We do not pre-allocate based on max_tokens.

Q4: How do you A/B test a new model version without dropping traffic?

Run the new model on a separate node pool labeled model-version=v2; the gateway routes a configurable percentage of requests there based on tenant or sticky session. Telemetry is tagged with the version so dashboards compare TTFT, TPOT, and quality metrics. Promote v2 by flipping the routing percentage to 100% and decommissioning the v1 pool. Hot-swap inside a single vLLM node is technically possible but risky — we prefer separate pools because rollback is just a routing change.

Q5: A customer reports their TTFT spikes occasionally. How do you debug?

First, check whether spikes correlate with prefill chunks being scheduled — long prompts entering the batch can briefly delay decode for everyone. The fix is enable-chunked-prefill: true which we already have on. Second, check K/V cache occupancy at the spike: if the cache is full, new requests are waiting for slots, not for compute. Third, check whether the spike correlates with cold starts (autoscaler added a node mid-burst); the new node accepted traffic before model load completed. Each has a distinct fingerprint in the metrics; the right tool is grouped traces tagged by tenant, prompt length, and node ID.

Q6: Walk me through what happens from "client sends request" to "client gets first token."

(1) Client SDK sends POST /v1/chat/completions with stream=true to our LB. (2) Envoy validates TLS, extracts JWT, routes to an API Gateway pod. (3) Gateway authorizes the tenant, checks the per-tenant quota bucket, checks cluster admission. (4) Gateway picks a vLLM node by least-loaded (free K/V slots, not request count). (5) Gateway opens a streaming connection to vLLM's OpenAI endpoint and starts forwarding the SSE stream back to the client. (6) On the vLLM node, the request enters the scheduler. If a prefix-cache hit exists for the system prompt (likely), only the new tokens need prefill. (7) Chunked prefill processes the input over a few decode steps. (8) Once prefill finishes, the first decoded token is generated and immediately streamed to the gateway, which streams to the client. (9) End-to-end TTFT: ~300 ms for short prompts, ~600 ms for long prompts — under the 600 ms p95 target. (10) Subsequent tokens stream at ~25 tok/sec until EOS, with TPOT under 40 ms.