The brief: build an evaluation platform for LLM applications that supports offline eval suites (gated against curated datasets), online A/B testing (production traffic split between model variants), and continuous regression detection (every commit on every product team). It must serve 50 product teams, scale to ~10,000 evals per day per team, keep judge-model token spend predictable, and surface results to a dashboard within five minutes of run completion.
Most "we built our own eval platform" projects fail in one of three ways: the schema can't represent the relationship between datasets, runs, and judgments cleanly; the judge cost spirals because nobody put a budget on LLM-as-judge calls; or the regression detector cries wolf so often that teams disable alerts. This design is opinionated about all three.
The platform must support these workflows:
Out of scope: prompt authoring UI (use the team's repo), model training, RAG-pipeline-specific evals beyond providing the hooks. Adversarial/red-team evaluation is a sibling system.
| Metric | Target |
|---|---|
| Run throughput per team | 10,000 cases/day sustained, 50,000/day burst |
| CI smoke-suite latency | p95 < 5 min for 100-case smoke |
| Time to dashboard after run completion | p95 < 5 min |
| Judge cost ceiling | $0.005 per judged case (target); $0.02 hard cap |
| Data retention | Hot 90 days, cold 2 years |
| Availability | 99.5% (eval is async; not customer-facing) |
| Regression alert false-positive rate | < 5% (or teams will mute it) |
The dashboard SLO matters: if engineers wait more than five minutes to see results, they context-switch and the eval loop loses its compounding effect. The judge cost ceiling matters more than people expect — LLM-as-judge using GPT-4-class models at $0.01–0.03 per case can outspend the production system being evaluated.
Aggregate eval volume. 50 teams × 10,000 cases/day = 500,000 evals/day, peak burst 2.5M. At ~1.5s per case (LLM call + judge call serial) on average, single-threaded that's 8,700 hours/day — trivially parallelizable.
Judge token spend. Average judge input ~800 tokens (prompt + case + ground truth + LLM output) and 100 tokens output. At GPT-4o pricing ($2.50 / $10.00 per 1M tokens):
per_case_judge_cost: (800 * 2.50 + 100 * 10.00) / 1_000_000 = $0.003
daily_total: 500_000 cases * $0.003 = $1,500/day = $45k/month
peak_burst: 2.5M * $0.003 = $7,500/day at burst
Mitigations baked into capacity: judge-model tier (Haiku for cheap pre-filter, GPT-4o only for borderline), prompt-cache the static parts of the judge prompt, sample (don't judge every online interaction).
Storage. Each result row: ~4 KB (input + output + judgment + metadata). 500k/day × 90 days hot ≈ 180 GB hot in Postgres. Cold tier (Parquet on S3): 4 TB across 2 years.
Compute for the runner. Most LLM call latency is network + remote inference. 50 concurrent workers × 1.5s avg ≈ 33 cases/sec sustained, 2.9M/day — comfortably above the 500k baseline. Burst absorbed by horizontal scale on Fargate or K8s HPA.
+------------------+ +-----------------+
| Team Repos | | Web Dashboard |
| (GH Actions) | | (Next.js) |
+--------+---------+ +--------+--------+
| trigger | reads
v v
+-----------------------+ +----------------------+
| Eval API (FastAPI) |<->| Auth (OIDC + RBAC) |
+----------+------------+ +----------------------+
|
| enqueue run
v
+-----------------------+ +-------------------------+
| Run Orchestrator |-------->| Job Queue (SQS / Redis)|
+----------+------------+ +-----------+-------------+
| |
| v
| +----------+-----------+
| | Eval Workers |
| | (Fargate/K8s, |
| | horizontal) |
| +----------+-----------+
| |
| +-------------------+-------------------+
| | | |
v v v v
+------------------+ +-----------+ +-----------+ +-------------------+
| Postgres (OLTP) | | LLM under| | Judge | | Cost Tracker |
| datasets, runs, | | test | | Models | | (Redis counters, |
| results, judges | | (team's) | | (Bedrock/| | nightly rollup) |
+--------+---------+ +-----------+ | OpenAI) | +-------------------+
| +-----------+
| CDC
v
+------------------+ +-------------------+
| S3 Parquet Lake |<------>| Trino / Snowflake| (analytics, regression)
+------------------+ +-------------------+
One-liners on each:
Postgres schema. Six core tables, one transaction boundary, normalized enough to query cleanly without becoming a join nightmare.
CREATE TABLE teams (
team_id UUID PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
monthly_budget_usd NUMERIC(10,2) NOT NULL DEFAULT 1000,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE datasets (
dataset_id UUID PRIMARY KEY,
team_id UUID NOT NULL REFERENCES teams,
name TEXT NOT NULL,
version INT NOT NULL,
case_count INT NOT NULL,
s3_uri TEXT NOT NULL, -- jsonl in S3
schema_hash BYTEA NOT NULL, -- detect schema drift across versions
created_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (team_id, name, version)
);
CREATE TABLE prompts (
prompt_id UUID PRIMARY KEY,
team_id UUID NOT NULL REFERENCES teams,
name TEXT NOT NULL,
version INT NOT NULL,
body TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (team_id, name, version)
);
CREATE TABLE runs (
run_id UUID PRIMARY KEY,
team_id UUID NOT NULL REFERENCES teams,
dataset_id UUID NOT NULL REFERENCES datasets,
prompt_id UUID REFERENCES prompts,
llm_model TEXT NOT NULL, -- e.g. "claude-sonnet-4-7"
judge_config JSONB NOT NULL, -- which judges, weights, thresholds
status TEXT NOT NULL, -- queued, running, completed, failed, budget_blocked
triggered_by TEXT, -- "ci:commit_sha" | "manual:user@" | "schedule"
commit_sha TEXT,
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
case_count INT NOT NULL,
cost_usd NUMERIC(10,4) DEFAULT 0
);
CREATE INDEX runs_team_started ON runs (team_id, started_at DESC);
CREATE INDEX runs_dataset ON runs (dataset_id, started_at DESC);
CREATE TABLE results (
result_id UUID PRIMARY KEY,
run_id UUID NOT NULL REFERENCES runs ON DELETE CASCADE,
case_id TEXT NOT NULL, -- stable ID from the dataset
input JSONB NOT NULL,
ground_truth JSONB,
model_output TEXT NOT NULL,
latency_ms INT,
tokens_in INT,
tokens_out INT,
cost_usd NUMERIC(10,6),
status TEXT NOT NULL, -- ok | model_error | timeout
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX results_run ON results (run_id);
CREATE TABLE judgments (
judgment_id UUID PRIMARY KEY,
result_id UUID NOT NULL REFERENCES results ON DELETE CASCADE,
judge_name TEXT NOT NULL, -- e.g. "factual_accuracy", "hallucination_v2"
judge_model TEXT NOT NULL, -- "gpt-4o" or "exact_match" for non-LLM judges
score NUMERIC(6,4), -- 0..1 normalized
rationale TEXT, -- judge's explanation, optional
cost_usd NUMERIC(10,6),
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX judgments_result ON judgments (result_id);
CREATE INDEX judgments_judge_score ON judgments (judge_name, score);
Why JSONB for input / ground_truth. Cases are heterogeneous across datasets — multi-turn conversations, RAG queries with retrieved chunks, agent traces. JSONB keeps the schema flat without a column explosion; specific dashboards extract the fields they need with jsonb_path_query.
Why a separate judgments table. A single result can have multiple judgments (factuality + harmlessness + format-compliance) and the set of judges evolves. One-row-per-result with N judge columns becomes painful within a quarter.
CDC to lake. Debezium on Postgres → Kafka → Iceberg tables in S3. Trino reads cross-team aggregates; Snowflake or DuckDB also work. The lake powers regression detection (compare today's nDCG vs the trailing 30-day mean per metric per dataset) and billing rollups.
A team commits to main. CI calls the eval API:
# .github/workflows/eval.yml
name: eval-on-pr
on: [pull_request]
jobs:
smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Trigger eval
env:
EVAL_TOKEN: ${{ secrets.EVAL_PLATFORM_TOKEN }}
run: |
RUN_ID=$(curl -s -X POST https://eval.internal/api/runs \
-H "Authorization: Bearer $EVAL_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"dataset": "rag-smoke@v3",
"prompt": "answer-prompt@v12",
"llm_model": "claude-sonnet-4-7",
"judge_config": {"judges": ["factuality", "format"], "judge_model": "gpt-4o-mini"},
"commit_sha": "${{ github.sha }}",
"triggered_by": "ci:${{ github.sha }}"
}' | jq -r .run_id)
echo "Run: https://eval.internal/runs/$RUN_ID"
# Block PR until done; polls every 10s, fails if regression detected.
./scripts/wait-for-eval.sh "$RUN_ID"
Inside the API, on POST /api/runs:
status='queued'. Return run_id.The worker, per case:
async def process_case(run: Run, case: Case):
# 1. Call the team's LLM endpoint with retries
t0 = time.time()
try:
out = await llm_client.invoke(
model=run.llm_model,
prompt=render(run.prompt, case.input),
timeout=60,
)
except (TimeoutError, ProviderError) as e:
await record_result(run, case, status="model_error", error=str(e))
return
# 2. Persist the model result row
result_id = await record_result(
run, case,
model_output=out.text,
latency_ms=int((time.time()-t0)*1000),
tokens_in=out.usage.input,
tokens_out=out.usage.output,
cost_usd=out.cost,
status="ok",
)
# 3. Run judges; some are deterministic (exact_match), some are LLM-as-judge
for judge in run.judge_config["judges"]:
score, rationale, cost = await JUDGES[judge].score(
case=case, model_output=out.text,
judge_model=run.judge_config["judge_model"],
)
await record_judgment(result_id, judge, score, rationale, cost)
await cost_tracker.add(run.team_id, cost)
# 4. Bump per-run progress
await runs.increment_progress(run.run_id)
When the orchestrator sees progress == case_count, it transitions the run to completed, computes aggregate metrics (mean per judge, p95 latency, cost), and writes them to a run_metrics view. Dashboards listen on a Postgres LISTEN/NOTIFY channel for sub-second refresh.
Regression detection runs on the lake side as a scheduled Trino query:
-- Per (team, dataset, judge), compare today's mean to trailing 30-day mean & stdev.
WITH baseline AS (
SELECT team_id, dataset_id, judge_name,
AVG(score) AS mu, STDDEV(score) AS sigma
FROM v_judgments
WHERE created_at >= current_date - INTERVAL '30' DAY
AND created_at < current_date - INTERVAL '1' DAY
GROUP BY team_id, dataset_id, judge_name
),
today AS (
SELECT team_id, dataset_id, judge_name, AVG(score) AS today_mean, COUNT(*) AS n
FROM v_judgments
WHERE created_at >= current_date
GROUP BY team_id, dataset_id, judge_name
)
SELECT t.*, b.mu, b.sigma,
(b.mu - t.today_mean) / NULLIF(b.sigma, 0) AS z_drop
FROM today t JOIN baseline b USING (team_id, dataset_id, judge_name)
WHERE t.n >= 100 -- enough samples for the test to mean anything
AND (b.mu - t.today_mean) / NULLIF(b.sigma, 0) > 2.5 -- 2.5-sigma drop
ORDER BY z_drop DESC;
Anything above 2.5σ with N ≥ 100 fires a Slack alert to the team's eval channel with deeplinks to the offending run.
judgments table. 2M inserts/day at burst, mostly JSONB. Use COPY-style bulk insert in batches of 500 from each worker; partition judgments by month with attach/detach for rolling retention.judgments.score = NULL with status='judge_unavailable'; the run completes but flagged as partially-judged. Operator can re-judge later from stored (case, model_output) without re-running the LLM under test.schema_hash on dataset upload; mid-run validation that each case parses against the declared schema; first 10 schema failures abort the run with a clear error rather than silently skipping cases.judge_model with a version (gpt-4o-2024-11-20, not "gpt-4o"). When the team wants to upgrade, run the new judge on a calibration set side-by-side with the old, surface the per-judge correlation, and only then swap.running. Worker crashes mid-batch; SQS message visibility expires, another worker re-processes. Idempotency key = (run_id, case_id) in the results table prevents double-write.status='budget_blocked'; CI fails fast with a link to the budget page; team admin can request a top-up that goes through approval before resuming.Per 1,000 cases evaluated (one LLM call + one judge call):
| Component | Cost / 1k cases |
|---|---|
| LLM under test (Claude Sonnet, ~600 in / 200 out) | $3.60 |
| Judge (GPT-4o-mini, 800 in / 100 out) | $0.18 |
| Judge (GPT-4o, 800 in / 100 out) | $3.00 |
| Worker compute (Fargate, ~1.5s @ 0.5 vCPU) | $0.02 |
| Postgres write + storage (90d hot) | $0.05 |
| S3 lake storage (2y cold) | $0.01 |
| Total / 1k (Sonnet + 4o-mini judge) | ~$3.86 |
| Total / 1k (Sonnet + GPT-4o judge) | ~$6.68 |
The judge model choice swings total cost by ~70%. Tier the judges: cheap deterministic checks (regex, JSON-parse, exact-match) for free; mini-class LLM judge for the 80% obvious cases; GPT-4o only for borderline scores or as the periodic calibration baseline. This brings effective per-case cost back near the LLM-under-test floor.
| Option | Wins on | Loses on |
|---|---|---|
| This DIY platform | Custom judges, integration with internal CI, no per-trace pricing, full data ownership. | Build cost; maintenance; needs a dedicated team. |
| LangSmith | Polished UI, built-in evaluators, tight LangChain integration. | Per-trace pricing scales painfully at 50 teams; vendor data residency; coupling to LangChain idioms. |
| Arize Phoenix | Open source, OpenTelemetry-native, runs on your own infra. | Less opinionated — you still build the regression detector and team-budget governance. |
| Promptfoo | YAML-driven, runs locally and in CI, no infra to host. | No central dashboards across teams; no historical store; no online A/B. |
| Braintrust | SaaS, strong dataset versioning + diff UI, fast onboarding. | Pricing model; data leaves the building; less customization. |
| OpenAI Evals | Free, transparent, large eval registry. | Single-tenant tooling; no online A/B; very limited governance. |
Decision rule. Below ~5 teams or < 10k evals/day total, buy (LangSmith, Braintrust). Above that, the per-trace pricing crosses build cost within a year, and the customization need (internal CI, per-team budgets, custom judges) starts to outweigh polished UX. Phoenix is a reasonable middle ground: open-source core, build the orchestration and governance on top.
Three layers. (1) Per-team monthly budget enforced at run-create time; estimate cost up front from case count and reject runs that would breach. (2) Mid-run circuit breaker: every 1k cases, compare actual cost to the estimate; pause if > 1.5x. (3) Tier the judges — deterministic checks (regex, JSON-parse, exact-match) for free, mini-class LLM (gpt-4o-mini, Haiku) for the bulk of cases, full GPT-4o only for borderline scores or periodic calibration. Layer (3) typically cuts judge spend 5–10x with negligible quality loss.
Different access patterns, different optimizers. Postgres handles the write-heavy hot path — transactional inserts of results and judgments, indexed lookups for "show me this run" dashboard pages — with predictable sub-second latency. Cross-team trend queries ("90-day metric over all teams, all datasets") are columnar workloads that Postgres does poorly above 100 GB; pushing them to Parquet on S3 with Trino keeps OLTP fast and isolates analytical load. CDC keeps the lake within seconds of OLTP without a separate ETL pipeline.
Three knobs. Set a minimum sample size (N ≥ 100 today vs trailing 30 days) so single noisy runs don't fire. Use z-score against the trailing window's stdev (2.5σ threshold), not absolute deltas — some metrics are inherently noisy. Per-team mute list with required justification and expiry; muted alerts surface in a weekly digest so they don't get forgotten. The combination keeps the false-positive rate near 5%, which is what teams will actually act on.
Pin the judge model with a date (gpt-4o-2024-11-20, never "gpt-4o" alias). For an upgrade, run both old and new judge on a fixed calibration set (~500 representative cases per judge family); compute correlation, mean-shift, and per-bucket disagreement. If correlation is > 0.9 and mean-shift is small, swap. If not, either keep the old judge or treat the new one as a separate metric and migrate teams individually. Never silently swap — baselines drift, alerts fire, trust evaporates.
Two pieces. First, traffic splitting happens in the team's serving layer (their feature flag service, e.g., LaunchDarkly or Statsig), not the eval platform — we don't sit in the request path. Teams tag each interaction with variant_id in the log payload they ship to us. Second, post-hoc judging: a sampler picks K% of logged interactions per variant per day, drops them through the same judge pipeline as offline, writes to results + judgments with triggered_by='online'. Dashboards group by variant_id, show metric deltas with confidence intervals from a Mann-Whitney U test (scores are not normal). This keeps the eval platform out of the critical serving path while reusing all the judge and dashboard machinery.
Pin everything. Dataset version + content hash, prompt version + content hash, LLM model with a date-stamped revision, judge model with a date-stamped revision, and the platform's own git SHA logged on the run row. Store the raw model output and judge rationale in results / judgments so you can re-judge later without re-invoking the LLM. The S3 lake is append-only; old judgments are never overwritten, so a run from six months ago is queryable in full. The only thing that's not perfectly reproducible is API-side determinism — setting temperature=0 and seed=N helps but providers don't all honor it, so for "is this regression real?" you re-run the eval rather than trusting old numbers verbatim.