Frontier Model Comparison (2026)

Picking a model in 2026 is no longer a single decision — it is a routing decision. Most production systems use two or three models: a small fast one for classification and reformulation, a mid-tier one for the bulk of traffic, and a frontier model for the hard 5% that needs deep reasoning, long context, or careful agentic tool use. This page lays out the current landscape, side-by-side specs and prices, where each model wins, what hosting options exist, and how to wire a router that escalates and falls back without rewriting your application.

Numbers below are hedged ("starts at", "as of 2026") because list prices, context windows, and even model names drift quarterly. Verify against the provider's pricing page before signing a vendor contract.

1. Model Landscape

One more framing point before the table: model choice is not a one-time decision. The frontier reshuffles every quarter (a new release here, a price cut there, an open-weight model crossing a quality threshold), and what was the obvious answer in January often is not the obvious answer by July. Treat your model selection as versioned, not static — the same way you treat your database engine version or your runtime version.

The 2026 frontier sits in three buckets. Frontier closed-weight models — Claude Opus 4.7, GPT-5, Gemini 2.x Pro — lead on multi-step reasoning, tool use, and long-context recall, served only via API or hyperscaler resale. Frontier open-weight models — Llama 3.3 70B, Mistral Large 2, Qwen 2.5 72B, DeepSeek V3 — have closed most of the quality gap on knowledge tasks and code, and you can self-host them on your own GPUs (or rent inference from Together, Fireworks, Groq, Bedrock). Smaller fast models — Claude Haiku 4.5, Gemini 2.x Flash, Llama 3.3 8B, GPT-5 mini — exist for high-throughput, latency-sensitive, or per-token cost-sensitive workloads where 90% quality at 5% cost is the right tradeoff.

The practical implication: a serious LLM application picks at least one model from bucket 1 or 2 for hard requests, plus one from bucket 3 for everything else. Routing between them is now an architectural concern, not a future optimization.

Two background trends frame everything below. First, the price-per-quality curve has dropped roughly 5x year-over-year since 2023; what cost $30 per million tokens for frontier output two years ago is now $3–$15. Second, open-weight models are within ~5–10 points of closed-weight on most benchmarks, which means the question "should we self-host?" now turns on operational and compliance economics rather than capability gaps.

2. Side-by-Side Comparison Table

Pricing is per million tokens, list price as of early 2026. Context and output windows are the published maximums; sustained throughput is usually lower.

Model	Provider	Context	Output	In $/1M	Out $/1M	Strengths	Hosting	Cutoff
Claude Opus 4.7	Anthropic	1M	64K	$15	$75	Long-doc reasoning, agentic tool use, code	API, Bedrock, Vertex	Jan 2026
Claude Sonnet 4.6	Anthropic	1M	64K	$3	$15	Best price/perf for production RAG and chat	API, Bedrock, Vertex	Late 2025
Claude Haiku 4.5	Anthropic	200K	16K	$0.80	$4	Low-latency classification, query rewrite	API, Bedrock, Vertex	Mid 2025
GPT-5	OpenAI	400K	128K	starts at $10	starts at $40	Code, math, structured output, vision	API, Azure	Late 2025
GPT-4o (legacy)	OpenAI	128K	16K	$2.50	$10	Mature ecosystem, strong general purpose	API, Azure	Oct 2023
Gemini 2.x Pro	Google	2M	64K	starts at $7	starts at $21	Largest context, native video, grounded search	Vertex, AI Studio	Late 2025
Gemini 2.x Flash	Google	1M	64K	$0.30	$2.50	Cheap long-context, batch summarization	Vertex, AI Studio	Late 2025
Llama 3.3 70B	Meta (open)	128K	8K	~$0.70 hosted	~$0.80 hosted	Open weights, fine-tunable, strong code	Self-host, Bedrock, Together, Fireworks, Groq	Dec 2023
Llama 3.3 8B	Meta (open)	128K	8K	~$0.10 hosted	~$0.10 hosted	Edge, on-device, embedded routing	Self-host, Bedrock, Together, Groq	Dec 2023
Mistral Large 2	Mistral (open weights)	128K	16K	$2	$6	European data residency, strong multilingual	La Plateforme, Bedrock, Azure, self-host	Mid 2024
Qwen 2.5 72B	Alibaba (open)	128K	8K	~$0.90 hosted	~$0.90 hosted	Strongest open Chinese, solid code, math	Self-host, DashScope, Together	Mid 2024
DeepSeek V3	DeepSeek (open)	128K	8K	$0.27	$1.10	MoE, very cheap, strong code, reasoning variant	API, self-host (heavy)	Mid 2024

Two things this table hides: cached input (Anthropic, Google, OpenAI all charge ~10% for cache hits, which dominates economics for long system prompts), and batch pricing (50% off on most providers if you can wait 24h). For high-volume offline workloads, batch + cache is often a 5–10x cost reduction over the headline number.

A second view, useful for quick mental anchoring — the rough price ratio of each model relative to the cheapest in the table (Llama 3.3 8B):

Model	Output cost vs Llama 8B
Llama 3.3 8B	1x (baseline)
Gemini 2.x Flash	~25x
DeepSeek V3	~11x
Llama 3.3 70B (hosted)	~8x
Claude Haiku 4.5	~40x
Mistral Large 2	~60x
GPT-4o (legacy)	~100x
Claude Sonnet 4.6	~150x
Gemini 2.x Pro	~210x
GPT-5	~400x
Claude Opus 4.7	~750x

This is the easiest way to argue for routing in a design review. Even if the smart model is "only" used on 5% of requests, that 5% can dominate cost if it's the most expensive tier — which is exactly why the escalation logic and the validator that triggers it deserve real engineering attention.

3. Capability Strengths by Use Case

Capability rankings turn over more than the table does. The honest framing per workload, as of early 2026:

Long-document reasoning over 200K+ tokens — Claude Opus 4.7 and Gemini 2.x Pro. Claude has higher needle-in-a-haystack recall in the 500K–1M range; Gemini wins above 1M and on native video.
Code generation and refactoring — Claude (Opus and Sonnet) and GPT-5 lead on SWE-Bench-Verified-style benchmarks; DeepSeek V3 is the best open option and sits within a few points of the closed leaders. Llama 3.3 70B is a competent fallback for self-hosted requirements.
Agentic tool use — Claude Opus 4.7 and GPT-5 are the only models reliable enough for multi-turn tool loops >10 steps without supervision. Sonnet handles 3–5 step loops well at a fraction of the price.
Multimodal vision — Claude (Opus, Sonnet, Haiku), Gemini (Pro, Flash), and GPT (5, 4o) all accept images; Gemini Pro is the only one that natively accepts video frames in context.
Low-latency batch — Haiku 4.5, Gemini 2.x Flash, and Llama 3.3 8B (especially on Groq's LPU at 500+ tok/s) are the choices for sub-second user-facing responses or millions-per-day batch jobs.
Local privacy mode — Llama 3.3 (70B or 8B), Qwen 2.5, and DeepSeek V3 are the realistic choices for fully on-prem deployments. Mistral Large 2 is the usual pick when European data residency is a hard requirement and self-hosting is not desired.
Structured output / JSON mode — GPT-5 and Claude (Opus, Sonnet) are the most reliable at producing valid JSON against a schema; Gemini and Llama are usable but require validate-and-retry loops more often. For high-stakes structured extraction, pair the call with a Pydantic validator and a single retry on parse failure.
Math and quantitative reasoning — GPT-5 and DeepSeek V3 (with its reasoning variant) currently lead. Claude Opus 4.7 closes most of the gap when given extended thinking. For pure arithmetic-heavy tasks, none of the LLMs match calling out to a Python tool — let the model write the code and execute it externally.

4. Cost and Latency Tradeoffs (Worked Example)

Quality is hard to compare on a single axis; cost and latency are not. Once you have established that two models meet your quality bar, the choice between them collapses to total cost-of-ownership and SLA fit. The worked example below uses a representative RAG workload — most production systems sit somewhere within an order of magnitude of these inputs.

Scenario: a RAG service answers 100,000 questions. Each request averages 4,000 input tokens (system prompt + retrieved context) and 500 output tokens. Total per run: 400M input tokens and 50M output tokens. Cost per model at list price, no caching, no batch discount:

Model	Input cost	Output cost	Total	~Latency p50
Claude Opus 4.7	$6,000	$3,750	$9,750	4–8s
Claude Sonnet 4.6	$1,200	$750	$1,950	2–4s
Claude Haiku 4.5	$320	$200	$520	0.5–1.5s
GPT-5	$4,000	$2,000	$6,000	3–7s
Gemini 2.x Pro	$2,800	$1,050	$3,850	3–6s
Gemini 2.x Flash	$120	$125	$245	0.5–1.5s
Llama 3.3 70B (Together)	$280	$40	$320	1–3s
Llama 3.3 8B (Groq)	$40	$5	$45	0.1–0.4s
DeepSeek V3	$108	$55	$163	2–5s

The 200x gap between Opus and Llama 8B is precisely why routing matters. Most user questions can be answered by Sonnet or Flash; only a small fraction actually need Opus or GPT-5. Two cost levers crush the absolute number: prompt caching (the 4K system prompt gets cached, dropping input cost by ~85% on the second request) and batch (50% off if the answer can wait).

Latency numbers above are p50 for a non-streaming completion. Streaming changes the user-perceived number — time-to-first-token on Sonnet is around 400ms, on Haiku around 200ms, on Groq-hosted Llama 8B under 100ms. If your UX is conversational, optimize for TTFT and stream; if it is batch (summarize a folder of PDFs overnight), optimize for total tokens-per-second and use the batch API. These are different optimizations and they sometimes pull in opposite directions on model choice.

A simple cost calculator you can drop into a notebook to size a workload:


# cost.py — back-of-envelope LLM workload pricing.
PRICING = {
    # ($ per 1M input tok, $ per 1M output tok)
    "claude-opus-4-7":     (15.00, 75.00),
    "claude-sonnet-4-6":   ( 3.00, 15.00),
    "claude-haiku-4-5":    ( 0.80,  4.00),
    "gpt-5":               (10.00, 40.00),
    "gemini-2-pro":        ( 7.00, 21.00),
    "gemini-2-flash":      ( 0.30,  2.50),
    "llama-3-3-70b":       ( 0.70,  0.80),
    "llama-3-3-8b":        ( 0.10,  0.10),
    "deepseek-v3":         ( 0.27,  1.10),
}

def estimate(model, n_requests, in_tok, out_tok, cache_hit_rate=0.0, batch=False):
    p_in, p_out = PRICING[model]
    eff_in_price = p_in * (1 - 0.9 * cache_hit_rate)   # 90% off cached input
    cost = (n_requests * in_tok / 1e6) * eff_in_price \
         + (n_requests * out_tok / 1e6) * p_out
    return cost * (0.5 if batch else 1.0)

for m in PRICING:
    c = estimate(m, n_requests=100_000, in_tok=4000, out_tok=500,
                 cache_hit_rate=0.8, batch=False)
    print(f"{m:24s} ${c:8,.0f}")

Streaming-first applications should also track tokens-per-second not just total latency — a 5s response that streams smoothly feels different from a 5s wall of silence. Most providers expose streaming via the same SDK with a stream=True flag; the cost is identical to non-streaming.


# stream.py — measure TTFT and tokens/sec.
import time
from anthropic import Anthropic

client = Anthropic()

t0 = time.time()
ttft = None
tokens = 0
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain CRDTs in 200 words."}],
) as stream:
    for text in stream.text_stream:
        if ttft is None:
            ttft = time.time() - t0
        tokens += len(text.split())
total = time.time() - t0
print(f"TTFT: {ttft*1000:.0f}ms  total: {total:.2f}s  rate: {tokens/total:.1f} tok/s")

Two operational warnings on cost modeling:

Output tokens dominate on chat workloads. A 4K input / 500 output ratio looks input-heavy by token count, but on most providers the output rate is 5x the input rate. Optimizing input length (caching, retrieval pruning) is helpful; optimizing output length (asking for shorter answers, structured outputs with tight schemas) is usually the bigger lever.
Reasoning tokens are billed and invisible. Models with extended thinking (Claude with extended_thinking, OpenAI o-series, Gemini Flash Thinking) emit hidden reasoning tokens that count against your output bill but are not shown in the response. A "500 token answer" can have 5,000 reasoning tokens behind it. Always log usage.output_tokens from the API response, not the length of the visible string.

5. Hosting Options

Where you can actually run each model matters as much as the model itself. Procurement, data-residency, and existing cloud commits often decide before quality does.

Model	Native API	AWS Bedrock	Azure	Vertex AI	Self-host (vLLM / TGI)
Claude (all)	Yes	Yes	No	Yes	No
GPT-5 / GPT-4o	Yes	No	Yes (Azure OpenAI)	No	No
Gemini 2.x	AI Studio	No	No	Yes	No
Llama 3.3	Together / Fireworks / Groq	Yes	Yes (MaaS)	Yes (MaaS)	Yes
Mistral Large 2	La Plateforme	Yes	Yes (MaaS)	Yes (MaaS)	Yes (open weights)
Qwen 2.5	DashScope	No	No	No	Yes
DeepSeek V3	Yes	No	No	No	Yes (heavy: 671B params)

Practical notes: Bedrock is the only place you get Claude, Llama, and Mistral on the same control plane (useful for compliance teams that want one audit surface). Azure OpenAI is the only enterprise route to GPT-5 and the path of least resistance if the rest of your stack is on Azure. Vertex AI is the only managed Gemini Pro option. For self-hosted Llama or Mistral, vLLM is the de-facto serving layer; for very high QPS on small models, Groq's LPU hardware is a category of its own on latency.

Hosting choice has second-order effects that show up months later:

Egress and data residency. Bedrock keeps data in the AWS region you call; Azure OpenAI similarly. Native APIs route to the provider's region of choice — usually US — which is often a non-starter for EU and certain regulated US workloads (HIPAA-covered data, ITAR, FedRAMP).
Quota and rate limits. Hyperscaler-resold models (Bedrock, Vertex, Azure) have separate quota pools from the native API. You can sometimes get higher throughput on Bedrock-Anthropic than on the native Anthropic API simply because you're tapping AWS's allocation.
Feature lag. Hyperscaler-resold versions often lag the native API by weeks on new features (extended thinking, prompt caching, new tools). If you want the bleeding edge, native API; if you want stability and procurement simplicity, hyperscaler.
Billing consolidation. Hyperscaler resale shows on your existing AWS / Azure / GCP bill — easier for finance, harder for granular cost attribution unless you also tag at request time.

A note on procurement: every hyperscaler offers provisioned throughput options (Bedrock Provisioned, Azure PTU, Vertex committed capacity) where you pre-purchase model capacity at a discount in exchange for a multi-month commitment. These look attractive in spreadsheets but are usually a trap unless you have firm baseline traffic — unused PTUs do not refund. Most teams should run on-demand for the first 3–6 months, get a real load profile, then commit only to the steady-state floor.

If you are evaluating Bedrock vs native API for Claude specifically, three subtle differences matter: (1) Bedrock applies its own region-specific quota that is separately negotiable from Anthropic's; (2) Bedrock supports cross-region inference profiles that auto-fail-over between regions for higher availability at no extra cost; (3) some Anthropic features (like extended thinking on the latest model) appear on the native API a few weeks before Bedrock. None of these are deal-breakers — just things to verify against your specific workload.

6. Routing Patterns

Two patterns cover most production needs:

Escalation (cheap → smart): try the cheap model first; if its self-reported confidence is low or a downstream validator (e.g., a JSON schema check, a search-result-found check, a unit test) fails, retry on the smart model. Captures the 95% of easy requests at cheap-model cost while still answering the hard 5%.
Fallback (smart → cheap): try the smart provider first; on rate-limit, 5xx, or timeout, transparently retry on a different provider. Pure availability play — both models cost the same — but it is the difference between 99.5% and 99.95% on user-visible uptime.

LiteLLM is the simplest router. It exposes an OpenAI-compatible endpoint, fronts ~100 providers, and supports both fallback and per-key budget tracking out of the box.


# litellm-config.yaml — router config with escalation + fallback.
model_list:
  - model_name: cheap
    litellm_params:
      model: groq/llama-3.3-8b
      api_key: os.environ/GROQ_API_KEY
  - model_name: mid
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: smart
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: smart-backup
    litellm_params:
      model: openai/gpt-5
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  fallbacks:
    - smart: ["smart-backup"]      # availability fallback
    - mid:   ["smart"]              # quality escalation handled in app code
  num_retries: 2
  timeout: 30
  routing_strategy: simple-shuffle
  redis_host: localhost            # for cross-process rate limit + budget state
  cache_responses: true

Application code can then call any provider through one OpenAI-shaped endpoint:


# router.py — escalate from mid to smart on validator failure.
from openai import OpenAI

client = OpenAI(base_url="http://localhost:4000", api_key="anything")

def answer(question: str) -> str:
    for tier in ("mid", "smart"):
        resp = client.chat.completions.create(
            model=tier,
            messages=[
                {"role": "system", "content": "Answer concisely. End with CONFIDENCE: HIGH or LOW."},
                {"role": "user",   "content": question},
            ],
        )
        text = resp.choices[0].message.content
        if "CONFIDENCE: HIGH" in text:
            return text.replace("CONFIDENCE: HIGH", "").strip()
    return text.replace("CONFIDENCE: LOW", "").strip()

Self-reported confidence is a weak signal on its own — pair it with a structural validator (was a citation produced? did the SQL parse? did the JSON validate?) for anything that matters.

For agentic systems specifically, a stronger escalation signal is tool-call disagreement: if two cheap models, given the same conversation history, choose different next tool calls, that is a high-information disagreement and worth a smart-model arbitration. Cheap on agreement (the common case), expensive only on actual ambiguity.


# escalate_on_tool_disagreement.py
def next_action(state):
    a = call_cheap("haiku-4-5",  state)   # returns {"tool": ..., "args": ...}
    b = call_cheap("gemini-flash", state)
    if a["tool"] == b["tool"] and a["args"] == b["args"]:
        return a                          # cheap path, ~95% of the time
    return call_smart("opus-4-7", state)  # arbitration on disagreement

One last consideration on routing: budget caps per request, per user, per tenant. The router is the right place to enforce them because the application doesn't always know the cost in advance. LiteLLM supports a max_budget on each virtual key; pair it with per-tenant Redis counters so a runaway loop on one customer cannot drain the shared budget for the rest.


# budget_guard.py — refuse the call if the tenant is at cap.
import redis, time

r = redis.Redis()

def check_and_charge(tenant_id: str, est_cost_usd: float, daily_cap: float) -> bool:
    key = f"budget:{tenant_id}:{time.strftime('%Y-%m-%d')}"
    spent = float(r.get(key) or 0)
    if spent + est_cost_usd > daily_cap:
        return False
    r.incrbyfloat(key, est_cost_usd)
    r.expire(key, 86_400 * 2)   # keep one day of slack for late reconciliation
    return True

Estimate cost with the cheaper-model price before the call (input tokens are knowable; output you cap at max_tokens); reconcile with actual usage after the response arrives. The pre-check is what prevents a single request from putting you 10x over budget — even off by 30% it's enough to keep cost incidents bounded.

One subtler routing pattern worth knowing: cascade with majority vote. Send a request to two cheap models in parallel; if they agree, return immediately; if they disagree, escalate to the smart model as tiebreaker. On classification-like tasks with a small label space this often beats a single mid-tier call on both quality and cost. The math: you pay 2x cheap on every request, plus 1x smart on the disagreement rate. If cheap-model agreement is 85% and the smart model is 25x the price of cheap, your effective cost is 2 + 0.15 * 25 = 5.75x cheap, well under a single mid-tier call at ~10x cheap. Worth modeling on your own data before adopting.

Also worth modeling: caching as a routing signal. If your system prompt is large and stable, the cache hit rate determines whether escalation is even worth it — escalating to a fresh model means re-paying for cache misses on the smart side. Some routers (LiteLLM included) support cache-aware fallback: prefer the model that already has the cache warm unless quality demands otherwise.

7. How to Choose

The decision is rarely "which model is best?" — it is "which model satisfies my hardest constraint, and which fallback covers the next-hardest?" Constraints come in this rough priority order in most enterprises: data residency > latency SLA > per-request cost ceiling > quality on hard requests > ecosystem integration. Lock the top constraint first.

A decision tree that fits on one screen:

Is the data legally allowed to leave your network? No → self-host Llama 3.3 70B (or Qwen 2.5 / DeepSeek V3 if you have the GPU budget). Stop. (If "yes but only US-region", a hyperscaler-resold Claude or Llama on AWS/Azure/GCP US works too.)
Does a single request exceed 200K tokens of context? Yes → Claude Opus/Sonnet (1M) or Gemini 2.x Pro (2M). Stop.
Is the task agentic (≥3 tool calls per request)? Yes → Claude Opus 4.7 or GPT-5; use Sonnet for shorter loops.
Is the task latency-critical (sub-second user-facing)? Yes → Haiku 4.5, Gemini Flash, or Llama 3.3 8B on Groq.
Is this a high-volume background job (millions/day, no user waiting)? Yes → cheapest capable model + batch API. Usually Gemini Flash or DeepSeek V3.
None of the above → default to Claude Sonnet 4.6 or GPT-5; this is the right starting point for ~80% of new applications.

Three patterns to avoid as starting points: (a) defaulting to GPT-4o because of "ecosystem familiarity" — the model is a generation behind on reasoning and 3x the price of Sonnet for similar quality; (b) starting on Opus because "best is best" — you'll burn cash for months before you realize Sonnet covered the same workload at 1/5 the cost; (c) starting on a self-hosted Llama before you have measured demand — GPU sunk cost is hard to walk back, and a managed API gives you 6 months of free elasticity.

Whatever you pick, build the routing seam from day one. Hardcoding a single model name in your application is the most expensive shortcut in LLM engineering — re-pricing or re-availability happens on the provider's schedule, not yours.

One last piece of pragmatism: the same code path should work across at least two providers from launch. The OpenAI-compatible chat completions schema is the de-facto lingua franca; LiteLLM, Together, Fireworks, Groq, vLLM, and Anthropic (via a thin adapter) all speak it. Stay on this shape and provider migration becomes a config change, not a refactor.


# providers.py — one client interface, three providers, one swap.
from openai import OpenAI

PROVIDERS = {
    "anthropic": OpenAI(base_url="https://api.anthropic.com/v1/", api_key=ANTHROPIC_KEY),
    "openai":    OpenAI(api_key=OPENAI_KEY),
    "together":  OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_KEY),
}

def chat(provider: str, model: str, messages: list[dict]) -> str:
    resp = PROVIDERS[provider].chat.completions.create(
        model=model, messages=messages, max_tokens=1024,
    )
    return resp.choices[0].message.content

# Same call shape, three different backends:
chat("anthropic", "claude-sonnet-4-6",  msgs)
chat("openai",    "gpt-5",              msgs)
chat("together",  "meta-llama/Llama-3.3-70B-Instruct-Turbo", msgs)

8. Common Pitfalls

Benchmarking on public leaderboards. MMLU, GSM8K, HumanEval are saturated and most frontier models post within a few points of each other. They tell you almost nothing about how a model will perform on your documents, prompts, and tool surface. Build a 100–500 example domain eval and rerun it before every model swap.
Ignoring tokenizer differences. The same English text tokenizes to different token counts across providers (Claude vs GPT vs Llama tokenizers differ by 5–15% on typical text). Cost estimates that assume "tokens" is a universal unit are off by that margin.
Treating context window as recall. Most models lose accuracy as context fills, especially in the middle. If you put 800K of retrieved docs in front of a 1M-context model, expect "lost in the middle" failures unless you re-rank and prune.
Overfitting to one provider's tool-calling format. Anthropic, OpenAI, and Google all expose function calling slightly differently. Wrap it in your own abstraction so a provider swap does not require rewriting every tool definition.
Forgetting cached input pricing. Most production cost models built without caching are 5–10x too high; conversely, designs that scatter system prompts across many micro-prompts give up the cache benefit. Consolidate static context at the top of the prompt where the cache lives.
Mistaking refusal-rate for safety. A model that refuses 5% of legitimate requests is a worse product than one that refuses 0.5%. Track your false-refusal rate as carefully as your true-refusal rate when comparing models.

9. Common Interview Q&A

When would you choose Claude Sonnet over Claude Opus in production?

For the bulk of RAG and chat traffic. Sonnet is roughly 1/5 the price and 2x faster, with quality close enough to Opus on summarization, classification, and short-loop tool use. Reserve Opus for the requests that fail a Sonnet validator — long-document analysis, agent loops over ~5 steps, or complex code edits across multiple files.

What is the cheapest way to cut your Anthropic or OpenAI bill in half without changing models?

Prompt caching plus batch. Caching a 4K system prompt drops repeated-input cost by ~85%; the batch API takes another 50% off everything if you can wait up to 24 hours. Together those routinely produce 5–10x reductions on workloads with stable system prompts and no real-time SLA. Only after that should you consider routing to a smaller model.

Why might you self-host Llama 3.3 70B even when Claude Sonnet is cheaper per token?

Three reasons: data cannot leave your network (regulatory or contractual), unit economics at very high QPS where amortized GPU cost beats per-token pricing, or the need to fine-tune on proprietary data. None of these is "Llama is better" — they are operational constraints that closed APIs cannot satisfy.

How would you set up a fallback when your primary provider rate-limits you?

Put LiteLLM (or an equivalent router) in front of two providers — for example Claude Sonnet 4.6 on Anthropic primary, GPT-5 on Azure backup. Configure automatic retry on 429 and 5xx with the second provider. Keep prompts and tool schemas neutral enough to work on both; track per-provider success rate and cost so you can shift the default if one degrades.

You need to summarize 50,000 1-page PDFs nightly. Which model and why?

Gemini 2.x Flash or DeepSeek V3 with the batch API. The job is offline (no latency SLA), per-document cost dominates, and 1-page PDFs do not need frontier reasoning. Either model at batch pricing comes in well under $50 for the run; Flash gets you native PDF/image input out of the box.

Why can context window not be compared on the spec sheet alone?

Published max context is the input limit, not the recall guarantee. Models degrade differently as context fills — needle-in-a-haystack accuracy at 800K is much lower than at 80K for most models, and hallucination on multi-document synthesis rises with input size. Always benchmark on your own retrieval-augmented documents at the size you actually expect to send, not at the headline number.