Picking a model in 2026 is no longer a single decision — it is a routing decision. Most production systems use two or three models: a small fast one for classification and reformulation, a mid-tier one for the bulk of traffic, and a frontier model for the hard 5% that needs deep reasoning, long context, or careful agentic tool use. This page lays out the current landscape, side-by-side specs and prices, where each model wins, what hosting options exist, and how to wire a router that escalates and falls back without rewriting your application.
Numbers below are hedged ("starts at", "as of 2026") because list prices, context windows, and even model names drift quarterly. Verify against the provider's pricing page before signing a vendor contract.
One more framing point before the table: model choice is not a one-time decision. The frontier reshuffles every quarter (a new release here, a price cut there, an open-weight model crossing a quality threshold), and what was the obvious answer in January often is not the obvious answer by July. Treat your model selection as versioned, not static — the same way you treat your database engine version or your runtime version.
The 2026 frontier sits in three buckets. Frontier closed-weight models — Claude Opus 4.7, GPT-5, Gemini 2.x Pro — lead on multi-step reasoning, tool use, and long-context recall, served only via API or hyperscaler resale. Frontier open-weight models — Llama 3.3 70B, Mistral Large 2, Qwen 2.5 72B, DeepSeek V3 — have closed most of the quality gap on knowledge tasks and code, and you can self-host them on your own GPUs (or rent inference from Together, Fireworks, Groq, Bedrock). Smaller fast models — Claude Haiku 4.5, Gemini 2.x Flash, Llama 3.3 8B, GPT-5 mini — exist for high-throughput, latency-sensitive, or per-token cost-sensitive workloads where 90% quality at 5% cost is the right tradeoff.
The practical implication: a serious LLM application picks at least one model from bucket 1 or 2 for hard requests, plus one from bucket 3 for everything else. Routing between them is now an architectural concern, not a future optimization.
Two background trends frame everything below. First, the price-per-quality curve has dropped roughly 5x year-over-year since 2023; what cost $30 per million tokens for frontier output two years ago is now $3–$15. Second, open-weight models are within ~5–10 points of closed-weight on most benchmarks, which means the question "should we self-host?" now turns on operational and compliance economics rather than capability gaps.
Pricing is per million tokens, list price as of early 2026. Context and output windows are the published maximums; sustained throughput is usually lower.
| Model | Provider | Context | Output | In $/1M | Out $/1M | Strengths | Hosting | Cutoff |
|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | 1M | 64K | $15 | $75 | Long-doc reasoning, agentic tool use, code | API, Bedrock, Vertex | Jan 2026 |
| Claude Sonnet 4.6 | Anthropic | 1M | 64K | $3 | $15 | Best price/perf for production RAG and chat | API, Bedrock, Vertex | Late 2025 |
| Claude Haiku 4.5 | Anthropic | 200K | 16K | $0.80 | $4 | Low-latency classification, query rewrite | API, Bedrock, Vertex | Mid 2025 |
| GPT-5 | OpenAI | 400K | 128K | starts at $10 | starts at $40 | Code, math, structured output, vision | API, Azure | Late 2025 |
| GPT-4o (legacy) | OpenAI | 128K | 16K | $2.50 | $10 | Mature ecosystem, strong general purpose | API, Azure | Oct 2023 |
| Gemini 2.x Pro | 2M | 64K | starts at $7 | starts at $21 | Largest context, native video, grounded search | Vertex, AI Studio | Late 2025 | |
| Gemini 2.x Flash | 1M | 64K | $0.30 | $2.50 | Cheap long-context, batch summarization | Vertex, AI Studio | Late 2025 | |
| Llama 3.3 70B | Meta (open) | 128K | 8K | ~$0.70 hosted | ~$0.80 hosted | Open weights, fine-tunable, strong code | Self-host, Bedrock, Together, Fireworks, Groq | Dec 2023 |
| Llama 3.3 8B | Meta (open) | 128K | 8K | ~$0.10 hosted | ~$0.10 hosted | Edge, on-device, embedded routing | Self-host, Bedrock, Together, Groq | Dec 2023 |
| Mistral Large 2 | Mistral (open weights) | 128K | 16K | $2 | $6 | European data residency, strong multilingual | La Plateforme, Bedrock, Azure, self-host | Mid 2024 |
| Qwen 2.5 72B | Alibaba (open) | 128K | 8K | ~$0.90 hosted | ~$0.90 hosted | Strongest open Chinese, solid code, math | Self-host, DashScope, Together | Mid 2024 |
| DeepSeek V3 | DeepSeek (open) | 128K | 8K | $0.27 | $1.10 | MoE, very cheap, strong code, reasoning variant | API, self-host (heavy) | Mid 2024 |
Two things this table hides: cached input (Anthropic, Google, OpenAI all charge ~10% for cache hits, which dominates economics for long system prompts), and batch pricing (50% off on most providers if you can wait 24h). For high-volume offline workloads, batch + cache is often a 5–10x cost reduction over the headline number.
A second view, useful for quick mental anchoring — the rough price ratio of each model relative to the cheapest in the table (Llama 3.3 8B):
| Model | Output cost vs Llama 8B |
|---|---|
| Llama 3.3 8B | 1x (baseline) |
| Gemini 2.x Flash | ~25x |
| DeepSeek V3 | ~11x |
| Llama 3.3 70B (hosted) | ~8x |
| Claude Haiku 4.5 | ~40x |
| Mistral Large 2 | ~60x |
| GPT-4o (legacy) | ~100x |
| Claude Sonnet 4.6 | ~150x |
| Gemini 2.x Pro | ~210x |
| GPT-5 | ~400x |
| Claude Opus 4.7 | ~750x |
This is the easiest way to argue for routing in a design review. Even if the smart model is "only" used on 5% of requests, that 5% can dominate cost if it's the most expensive tier — which is exactly why the escalation logic and the validator that triggers it deserve real engineering attention.
Capability rankings turn over more than the table does. The honest framing per workload, as of early 2026:
Quality is hard to compare on a single axis; cost and latency are not. Once you have established that two models meet your quality bar, the choice between them collapses to total cost-of-ownership and SLA fit. The worked example below uses a representative RAG workload — most production systems sit somewhere within an order of magnitude of these inputs.
Scenario: a RAG service answers 100,000 questions. Each request averages 4,000 input tokens (system prompt + retrieved context) and 500 output tokens. Total per run: 400M input tokens and 50M output tokens. Cost per model at list price, no caching, no batch discount:
| Model | Input cost | Output cost | Total | ~Latency p50 |
|---|---|---|---|---|
| Claude Opus 4.7 | $6,000 | $3,750 | $9,750 | 4–8s |
| Claude Sonnet 4.6 | $1,200 | $750 | $1,950 | 2–4s |
| Claude Haiku 4.5 | $320 | $200 | $520 | 0.5–1.5s |
| GPT-5 | $4,000 | $2,000 | $6,000 | 3–7s |
| Gemini 2.x Pro | $2,800 | $1,050 | $3,850 | 3–6s |
| Gemini 2.x Flash | $120 | $125 | $245 | 0.5–1.5s |
| Llama 3.3 70B (Together) | $280 | $40 | $320 | 1–3s |
| Llama 3.3 8B (Groq) | $40 | $5 | $45 | 0.1–0.4s |
| DeepSeek V3 | $108 | $55 | $163 | 2–5s |
The 200x gap between Opus and Llama 8B is precisely why routing matters. Most user questions can be answered by Sonnet or Flash; only a small fraction actually need Opus or GPT-5. Two cost levers crush the absolute number: prompt caching (the 4K system prompt gets cached, dropping input cost by ~85% on the second request) and batch (50% off if the answer can wait).
Latency numbers above are p50 for a non-streaming completion. Streaming changes the user-perceived number — time-to-first-token on Sonnet is around 400ms, on Haiku around 200ms, on Groq-hosted Llama 8B under 100ms. If your UX is conversational, optimize for TTFT and stream; if it is batch (summarize a folder of PDFs overnight), optimize for total tokens-per-second and use the batch API. These are different optimizations and they sometimes pull in opposite directions on model choice.
A simple cost calculator you can drop into a notebook to size a workload:
# cost.py — back-of-envelope LLM workload pricing.
PRICING = {
# ($ per 1M input tok, $ per 1M output tok)
"claude-opus-4-7": (15.00, 75.00),
"claude-sonnet-4-6": ( 3.00, 15.00),
"claude-haiku-4-5": ( 0.80, 4.00),
"gpt-5": (10.00, 40.00),
"gemini-2-pro": ( 7.00, 21.00),
"gemini-2-flash": ( 0.30, 2.50),
"llama-3-3-70b": ( 0.70, 0.80),
"llama-3-3-8b": ( 0.10, 0.10),
"deepseek-v3": ( 0.27, 1.10),
}
def estimate(model, n_requests, in_tok, out_tok, cache_hit_rate=0.0, batch=False):
p_in, p_out = PRICING[model]
eff_in_price = p_in * (1 - 0.9 * cache_hit_rate) # 90% off cached input
cost = (n_requests * in_tok / 1e6) * eff_in_price \
+ (n_requests * out_tok / 1e6) * p_out
return cost * (0.5 if batch else 1.0)
for m in PRICING:
c = estimate(m, n_requests=100_000, in_tok=4000, out_tok=500,
cache_hit_rate=0.8, batch=False)
print(f"{m:24s} ${c:8,.0f}")
Streaming-first applications should also track tokens-per-second not just total latency — a 5s response that streams smoothly feels different from a 5s wall of silence. Most providers expose streaming via the same SDK with a stream=True flag; the cost is identical to non-streaming.
# stream.py — measure TTFT and tokens/sec.
import time
from anthropic import Anthropic
client = Anthropic()
t0 = time.time()
ttft = None
tokens = 0
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain CRDTs in 200 words."}],
) as stream:
for text in stream.text_stream:
if ttft is None:
ttft = time.time() - t0
tokens += len(text.split())
total = time.time() - t0
print(f"TTFT: {ttft*1000:.0f}ms total: {total:.2f}s rate: {tokens/total:.1f} tok/s")
Two operational warnings on cost modeling:
extended_thinking, OpenAI o-series, Gemini Flash Thinking) emit hidden reasoning tokens that count against your output bill but are not shown in the response. A "500 token answer" can have 5,000 reasoning tokens behind it. Always log usage.output_tokens from the API response, not the length of the visible string.Where you can actually run each model matters as much as the model itself. Procurement, data-residency, and existing cloud commits often decide before quality does.
| Model | Native API | AWS Bedrock | Azure | Vertex AI | Self-host (vLLM / TGI) |
|---|---|---|---|---|---|
| Claude (all) | Yes | Yes | No | Yes | No |
| GPT-5 / GPT-4o | Yes | No | Yes (Azure OpenAI) | No | No |
| Gemini 2.x | AI Studio | No | No | Yes | No |
| Llama 3.3 | Together / Fireworks / Groq | Yes | Yes (MaaS) | Yes (MaaS) | Yes |
| Mistral Large 2 | La Plateforme | Yes | Yes (MaaS) | Yes (MaaS) | Yes (open weights) |
| Qwen 2.5 | DashScope | No | No | No | Yes |
| DeepSeek V3 | Yes | No | No | No | Yes (heavy: 671B params) |
Practical notes: Bedrock is the only place you get Claude, Llama, and Mistral on the same control plane (useful for compliance teams that want one audit surface). Azure OpenAI is the only enterprise route to GPT-5 and the path of least resistance if the rest of your stack is on Azure. Vertex AI is the only managed Gemini Pro option. For self-hosted Llama or Mistral, vLLM is the de-facto serving layer; for very high QPS on small models, Groq's LPU hardware is a category of its own on latency.
Hosting choice has second-order effects that show up months later:
A note on procurement: every hyperscaler offers provisioned throughput options (Bedrock Provisioned, Azure PTU, Vertex committed capacity) where you pre-purchase model capacity at a discount in exchange for a multi-month commitment. These look attractive in spreadsheets but are usually a trap unless you have firm baseline traffic — unused PTUs do not refund. Most teams should run on-demand for the first 3–6 months, get a real load profile, then commit only to the steady-state floor.
If you are evaluating Bedrock vs native API for Claude specifically, three subtle differences matter: (1) Bedrock applies its own region-specific quota that is separately negotiable from Anthropic's; (2) Bedrock supports cross-region inference profiles that auto-fail-over between regions for higher availability at no extra cost; (3) some Anthropic features (like extended thinking on the latest model) appear on the native API a few weeks before Bedrock. None of these are deal-breakers — just things to verify against your specific workload.
Two patterns cover most production needs:
LiteLLM is the simplest router. It exposes an OpenAI-compatible endpoint, fronts ~100 providers, and supports both fallback and per-key budget tracking out of the box.
# litellm-config.yaml — router config with escalation + fallback.
model_list:
- model_name: cheap
litellm_params:
model: groq/llama-3.3-8b
api_key: os.environ/GROQ_API_KEY
- model_name: mid
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: smart
litellm_params:
model: anthropic/claude-opus-4-7
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: smart-backup
litellm_params:
model: openai/gpt-5
api_key: os.environ/OPENAI_API_KEY
router_settings:
fallbacks:
- smart: ["smart-backup"] # availability fallback
- mid: ["smart"] # quality escalation handled in app code
num_retries: 2
timeout: 30
routing_strategy: simple-shuffle
redis_host: localhost # for cross-process rate limit + budget state
cache_responses: true
Application code can then call any provider through one OpenAI-shaped endpoint:
# router.py — escalate from mid to smart on validator failure.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:4000", api_key="anything")
def answer(question: str) -> str:
for tier in ("mid", "smart"):
resp = client.chat.completions.create(
model=tier,
messages=[
{"role": "system", "content": "Answer concisely. End with CONFIDENCE: HIGH or LOW."},
{"role": "user", "content": question},
],
)
text = resp.choices[0].message.content
if "CONFIDENCE: HIGH" in text:
return text.replace("CONFIDENCE: HIGH", "").strip()
return text.replace("CONFIDENCE: LOW", "").strip()
Self-reported confidence is a weak signal on its own — pair it with a structural validator (was a citation produced? did the SQL parse? did the JSON validate?) for anything that matters.
For agentic systems specifically, a stronger escalation signal is tool-call disagreement: if two cheap models, given the same conversation history, choose different next tool calls, that is a high-information disagreement and worth a smart-model arbitration. Cheap on agreement (the common case), expensive only on actual ambiguity.
# escalate_on_tool_disagreement.py
def next_action(state):
a = call_cheap("haiku-4-5", state) # returns {"tool": ..., "args": ...}
b = call_cheap("gemini-flash", state)
if a["tool"] == b["tool"] and a["args"] == b["args"]:
return a # cheap path, ~95% of the time
return call_smart("opus-4-7", state) # arbitration on disagreement
One last consideration on routing: budget caps per request, per user, per tenant. The router is the right place to enforce them because the application doesn't always know the cost in advance. LiteLLM supports a max_budget on each virtual key; pair it with per-tenant Redis counters so a runaway loop on one customer cannot drain the shared budget for the rest.
# budget_guard.py — refuse the call if the tenant is at cap.
import redis, time
r = redis.Redis()
def check_and_charge(tenant_id: str, est_cost_usd: float, daily_cap: float) -> bool:
key = f"budget:{tenant_id}:{time.strftime('%Y-%m-%d')}"
spent = float(r.get(key) or 0)
if spent + est_cost_usd > daily_cap:
return False
r.incrbyfloat(key, est_cost_usd)
r.expire(key, 86_400 * 2) # keep one day of slack for late reconciliation
return True
Estimate cost with the cheaper-model price before the call (input tokens are knowable; output you cap at max_tokens); reconcile with actual usage after the response arrives. The pre-check is what prevents a single request from putting you 10x over budget — even off by 30% it's enough to keep cost incidents bounded.
One subtler routing pattern worth knowing: cascade with majority vote. Send a request to two cheap models in parallel; if they agree, return immediately; if they disagree, escalate to the smart model as tiebreaker. On classification-like tasks with a small label space this often beats a single mid-tier call on both quality and cost. The math: you pay 2x cheap on every request, plus 1x smart on the disagreement rate. If cheap-model agreement is 85% and the smart model is 25x the price of cheap, your effective cost is 2 + 0.15 * 25 = 5.75x cheap, well under a single mid-tier call at ~10x cheap. Worth modeling on your own data before adopting.
Also worth modeling: caching as a routing signal. If your system prompt is large and stable, the cache hit rate determines whether escalation is even worth it — escalating to a fresh model means re-paying for cache misses on the smart side. Some routers (LiteLLM included) support cache-aware fallback: prefer the model that already has the cache warm unless quality demands otherwise.
The decision is rarely "which model is best?" — it is "which model satisfies my hardest constraint, and which fallback covers the next-hardest?" Constraints come in this rough priority order in most enterprises: data residency > latency SLA > per-request cost ceiling > quality on hard requests > ecosystem integration. Lock the top constraint first.
A decision tree that fits on one screen:
Three patterns to avoid as starting points: (a) defaulting to GPT-4o because of "ecosystem familiarity" — the model is a generation behind on reasoning and 3x the price of Sonnet for similar quality; (b) starting on Opus because "best is best" — you'll burn cash for months before you realize Sonnet covered the same workload at 1/5 the cost; (c) starting on a self-hosted Llama before you have measured demand — GPU sunk cost is hard to walk back, and a managed API gives you 6 months of free elasticity.
Whatever you pick, build the routing seam from day one. Hardcoding a single model name in your application is the most expensive shortcut in LLM engineering — re-pricing or re-availability happens on the provider's schedule, not yours.
One last piece of pragmatism: the same code path should work across at least two providers from launch. The OpenAI-compatible chat completions schema is the de-facto lingua franca; LiteLLM, Together, Fireworks, Groq, vLLM, and Anthropic (via a thin adapter) all speak it. Stay on this shape and provider migration becomes a config change, not a refactor.
# providers.py — one client interface, three providers, one swap.
from openai import OpenAI
PROVIDERS = {
"anthropic": OpenAI(base_url="https://api.anthropic.com/v1/", api_key=ANTHROPIC_KEY),
"openai": OpenAI(api_key=OPENAI_KEY),
"together": OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_KEY),
}
def chat(provider: str, model: str, messages: list[dict]) -> str:
resp = PROVIDERS[provider].chat.completions.create(
model=model, messages=messages, max_tokens=1024,
)
return resp.choices[0].message.content
# Same call shape, three different backends:
chat("anthropic", "claude-sonnet-4-6", msgs)
chat("openai", "gpt-5", msgs)
chat("together", "meta-llama/Llama-3.3-70B-Instruct-Turbo", msgs)
For the bulk of RAG and chat traffic. Sonnet is roughly 1/5 the price and 2x faster, with quality close enough to Opus on summarization, classification, and short-loop tool use. Reserve Opus for the requests that fail a Sonnet validator — long-document analysis, agent loops over ~5 steps, or complex code edits across multiple files.
Prompt caching plus batch. Caching a 4K system prompt drops repeated-input cost by ~85%; the batch API takes another 50% off everything if you can wait up to 24 hours. Together those routinely produce 5–10x reductions on workloads with stable system prompts and no real-time SLA. Only after that should you consider routing to a smaller model.
Three reasons: data cannot leave your network (regulatory or contractual), unit economics at very high QPS where amortized GPU cost beats per-token pricing, or the need to fine-tune on proprietary data. None of these is "Llama is better" — they are operational constraints that closed APIs cannot satisfy.
Put LiteLLM (or an equivalent router) in front of two providers — for example Claude Sonnet 4.6 on Anthropic primary, GPT-5 on Azure backup. Configure automatic retry on 429 and 5xx with the second provider. Keep prompts and tool schemas neutral enough to work on both; track per-provider success rate and cost so you can shift the default if one degrades.
Gemini 2.x Flash or DeepSeek V3 with the batch API. The job is offline (no latency SLA), per-document cost dominates, and 1-page PDFs do not need frontier reasoning. Either model at batch pricing comes in well under $50 for the run; Flash gets you native PDF/image input out of the box.
Published max context is the input limit, not the recall guarantee. Models degrade differently as context fills — needle-in-a-haystack accuracy at 800K is much lower than at 80K for most models, and hallucination on multi-document synthesis rises with input size. Always benchmark on your own retrieval-augmented documents at the size you actually expect to send, not at the headline number.