The brief: build a SaaS platform where each customer can bring their own data, choose their model provider (Bedrock, Azure OpenAI, or self-hosted vLLM), and receive isolated cost and usage reporting. Customer A is a US healthcare provider that requires HIPAA-eligible Bedrock in us-east-1. Customer B is a German bank that requires Azure OpenAI in EU regions with customer-managed keys. Customer C is a defense contractor that requires fully on-prem inference. Same product, three deployment realities.
Multi-tenancy looks like a feature checkbox until you have to debug a noisy-neighbor issue at 2 AM, or explain to a customer why their bill jumped 4×, or convince a security auditor that tenant A's keys cannot decrypt tenant B's data even with root on the database host. This design is opinionated about all three.
| Metric | Target |
|---|---|
| Routing overhead added to LLM call | < 25 ms p95 |
| Quota check latency | < 5 ms p95 |
| Tenant onboarding time | < 30 min, fully automated |
| Cost report freshness | < 5 min from request |
| Cross-tenant data leak rate | Zero, enforced at multiple layers |
| Availability (data plane) | 99.95% per region |
| Availability (control plane) | 99.9% (it can be down briefly without taking inference down) |
Tenants and traffic. Plan for 500 paying tenants, average 10 RPS each at peak, top-1% tenants at 200 RPS. Aggregate ~5,000 sustained RPS, ~15,000 peak. Average request: 2,000 input tokens + 500 output tokens.
Token volume. 5,000 RPS × 2,500 tokens/req ≈ 12.5M tokens/sec aggregate ≈ 1 trillion tokens/day. Of that, ≈ 70% routed to managed providers (Bedrock/Azure), ≈ 30% to self-hosted vLLM.
Per-tenant data. Average 100k documents at 4 KB ≈ 400 MB text per tenant, 4 GB after embeddings. 500 tenants ≈ 2 TB total at steady state; allow 10 TB headroom for data-heavy tenants.
Self-hosted GPU pool. 30% of 12.5M tokens/sec = 3.75M tokens/sec on vLLM. At Llama 3.1 70B FP8 throughput of ~3,000 tokens/sec/H100 batched, need ~1,250 H100s. That is implausible budget-wise; the design assumption is that self-hosted is an opt-in tier reserved for ~5% of traffic. Right-size for ~200 H100s aggregate, or ~$5M/year of GPU lease.
The single most important design choice in a multi-tenant platform is the split between control plane (slow, transactional, can briefly be down) and data plane (fast, hot path, must never be down).
+======================== CONTROL PLANE ========================+
| + tenant onboarding API + KMS key provisioning |
| + quota / billing config + provider credential storage |
| + audit / compliance UI + cost reporting aggregation |
| Backed by: Postgres (tenants), Vault (secrets), S3 (logs) |
| SLO: 99.9% (down for 5 min = no new tenants, inference fine) |
+================================================================+
|
policy + creds pushed via
versioned snapshots (S3 + ETag)
v
+========================= DATA PLANE ==========================+
| + API gateway + auth + rate limit |
| + provider router (Bedrock / Azure / vLLM) |
| + per-tenant vector index access |
| + token-counting + per-request cost emission |
| Backed by: per-region Redis, Postgres replicas, Kafka |
| SLO: 99.95% (every 1 min down = real customer impact) |
+================================================================+
The data plane caches everything it needs from the control plane: tenant config, quota state, provider credentials, model routing rules. If the control plane is down, inference continues using the last cached snapshot. New tenant onboarding pauses; existing tenants are untouched.
Provider credentials are pushed via short-lived signed snapshots from the control plane to each data-plane region; the data plane never queries Vault on the hot path.
Three isolation models exist; pick deliberately based on the tenant's risk profile, not as a one-size-fits-all default.
| Model | Isolation | Cost | When to use |
|---|---|---|---|
| Shared schema + RLS | App-layer + DB row-level security | Cheapest | SMB tier, no regulatory ask |
| Schema-per-tenant | DB-level, separate ACLs | Moderate | Enterprise; auditable boundary |
| Database-per-tenant | Physical, separate KMS key | Highest | Regulated (HIPAA, ITAR, EU sovereignty) |
The default is schema-per-tenant; promote to database-per-tenant on contractual requirement. Below is the schema-per-tenant skeleton:
-- Control-plane catalog
CREATE TABLE tenants (
tenant_id UUID PRIMARY KEY,
name TEXT NOT NULL,
region TEXT NOT NULL, -- us-east-1, eu-central-1, on-prem-dc1
isolation_model TEXT NOT NULL, -- 'rls', 'schema', 'database'
schema_name TEXT, -- e.g. 'tenant_a1b2c3'
kms_key_arn TEXT NOT NULL, -- customer-managed key
provider_config JSONB NOT NULL, -- {bedrock: {...}, azure: {...}, vllm: {...}}
quota_rpm INT NOT NULL,
quota_tpm BIGINT NOT NULL,
monthly_budget_usd NUMERIC NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
-- Per-tenant schema (created on onboarding, identical structure)
CREATE SCHEMA tenant_a1b2c3;
SET search_path TO tenant_a1b2c3;
CREATE TABLE documents (...);
CREATE TABLE chunks (...);
CREATE TABLE api_keys (...);
CREATE TABLE usage (request_id UUID, ts TIMESTAMPTZ, model TEXT,
input_tokens INT, output_tokens INT, cost_usd NUMERIC);
Encryption. Every per-tenant Postgres schema has its tablespace encrypted with the tenant's customer-managed KMS key (CMK) via envelope encryption. The platform holds only the wrapped data encryption key (DEK); revoking the CMK in the customer's AWS account makes the entire schema unreadable, instantly. This is the design feature compliance teams care most about and is hard to bolt on later.
Vector indexes. Per-tenant chunks table inside
the tenant schema means each tenant's HNSW index is physically separate. No
cross-tenant query is possible by construction; no RLS predicate to forget.
async def chat_completion(req: ChatRequest, api_key: str) -> ChatResponse:
# 1. Auth + tenant lookup (cached, ~1 ms)
tenant = await tenant_cache.lookup_by_key(api_key)
if tenant is None:
raise HTTPException(401)
# 2. Quota check (Redis token bucket, ~2 ms)
if not await quota.try_consume(
tenant.id, rpm=1, tpm=estimate_tokens(req)
):
raise HTTPException(429, headers=quota.retry_after(tenant.id))
# 3. Region pin: refuse if request hit the wrong region
if tenant.region != CURRENT_REGION:
raise HTTPException(421, detail=f"redirect to {tenant.region}")
# 4. Provider routing (~1 ms, in-memory)
provider = router.pick(tenant, req.model)
# provider is one of: BedrockProvider, AzureProvider, VLLMProvider
# 5. Decrypt provider creds with tenant CMK (cached short-lived, ~3 ms first call)
creds = await creds_cache.get(tenant.id, provider.name)
# 6. Forward call (the 99% of latency)
start = time.monotonic_ns()
resp = await provider.complete(req, creds)
latency_ns = time.monotonic_ns() - start
# 7. Emit usage event to Kafka (fire-and-forget, non-blocking)
usage_emitter.emit(UsageEvent(
tenant_id=tenant.id,
api_key=api_key,
model=resp.model,
input_tokens=resp.usage.input_tokens,
output_tokens=resp.usage.output_tokens,
cost_usd=price.compute(provider, resp.usage),
latency_ns=latency_ns,
))
return resp
The routing rules live in tenant.provider_config as JSONB:
provider_config:
default_model: claude-3-5-sonnet
routes:
- model_pattern: "claude-*"
provider: bedrock
region: us-east-1
role_arn: arn:aws:iam::123456789012:role/tenant-a-bedrock
- model_pattern: "gpt-4*"
provider: azure
endpoint: https://tenant-a.openai.azure.com
deployment: gpt-4o
- model_pattern: "llama-3.3-70b"
provider: vllm
endpoint: https://vllm-pool-1.internal:8000
fallback:
when: provider_5xx
use: bedrock/claude-3-5-haiku
Idempotency. Every chat request requires an idempotency key header. The router uses it as a Redis key with 24 h TTL pointing at the prior response; replays return the cached response without re-billing the customer or re-calling the provider.
Per 1M tokens routed (averaged across providers):
provider_pass_through: $3.00 # weighted avg of Bedrock/Azure/vLLM
platform_overhead:
api_gateway: $0.10
routing_compute: $0.05
quota_redis: $0.02
usage_pipeline: $0.05
storage_postgres: $0.08
audit_logs_s3: $0.03
---
cost_to_us: $3.33
list_price_to_customer: $4.50 # 35% margin
Self-hosted vLLM tier is fixed-cost per H100 (~$2/hr lease) regardless of volume; the marginal token cost is essentially zero, so the platform charges a flat per-PTU monthly fee plus a small overage rate.
Cost attribution flow:
request --> data plane --> emit UsageEvent (Kafka)
|
+-------------------+-------------------+
| |
v v
Redis incr by tenant_id ClickHouse insert (sharded)
(live dashboard, 1s lag) (billing source of truth)
| |
v v
/usage/live API endpoint nightly invoice generator
reconciliation vs Stripe
Two pipelines, two consistency guarantees: live dashboards prioritize speed, billing prioritizes accuracy. They will diverge by < 1% in normal operation; billing always wins in disputes.
Schema-per-tenant vs RLS vs database-per-tenant. RLS is fine
for SMB and removes operational complexity, but a single missed
SET app.tenant_id in a query path becomes a cross-tenant leak.
Schema-per-tenant moves isolation to the database boundary — the wrong
search path returns "relation does not exist" rather than "leaks data," which
is a much easier failure mode to detect. Database-per-tenant is required for
HIPAA/ITAR/EU-sovereignty tenants and gives the strongest CMK story but
multiplies operational cost. We use all three; the tier is set per tenant, not
per platform.
Per-tenant model routing in the gateway vs in the application. Doing it in the gateway means every tenant gets the same routing logic and cost emission for free, and applications stay simple. Doing it in the application gives finer control (e.g. retry to a cheaper model on long inputs) but duplicates logic across teams. The platform should provide both: a default gateway route and an opt-in "I'll route myself" mode for sophisticated customers.
Centralized vs per-region control plane. Centralized is simpler but creates a cross-region dependency for new tenant onboarding in EU and on-prem deployments. Per-region replicates the control plane and uses a global tenant catalog (DynamoDB Global Tables) as the only cross-region surface. The right answer for a regulated platform is per-region; the small extra cost is worth removing the cross-region dependency for compliance.
KMS keys per tenant vs shared. Shared keys are operationally trivial but make the "what happens when a customer asks us to delete all their data" question much harder. Per-tenant CMKs give cryptographic erasure: revoke the key, the data is gone in any meaningful sense, and you do not have to wait for backups to expire. This is the single most important design choice for compliance posture.
Local vLLM vs hosted Bedrock for self-hosted tier. If a tenant needs on-prem, vLLM is the right answer; you control the GPU pool, the model, and the inference SLO. The cost is operational: GPU drivers, NCCL tuning, CUDA upgrades, OOM debugging, all on you. For tenants who only need data sovereignty (not air-gap), Bedrock in their preferred region with their CMK is dramatically cheaper to operate than running your own vLLM pool.
Two layers. At the gateway, per-tenant token-bucket admission control caps concurrent requests by paid tier. Inside vLLM, a per-tenant priority class plumbed into the continuous-batching scheduler ensures fair share of the batch even when one tenant fills the queue. If a tenant exceeds their tier sustained for 15 min, an autoscaler spins them their own dedicated pool and the gateway routes them there. The user-visible contract is: paid tier guarantees a minimum throughput, never a maximum.
Two scenarios. If they rotate (create a new version, mark old as not-primary), AWS KMS keeps the old version available for decryption; in-flight requests succeed and the next encryption uses the new version. If they revoke the key entirely, the next decryption fails and the data plane returns 503 with a clear error. Active connections drain over the next ~30 s as cached DEKs expire. We notify the customer in the dashboard and require an explicit confirmation in the rotation flow to prevent accidental revocation.
Sell provisioned throughput units — each PTU guarantees, say, 1,000 tokens/sec sustained throughput. Map that to a fraction of a real GPU (an H100 doing Llama 3.3 70B FP8 yields ~3,000 tokens/sec, so 1 PTU = ~0.33 H100). The customer pays a flat monthly rate per PTU; overage above their PTU is throttled or billed at a higher per-token rate. This aligns with how Bedrock and Azure already sell PTUs, so customers understand the model.
Three reasons. (1) Different SLOs — the data plane must be 99.95%, the control plane only 99.9%; conflating them forces the cheaper one to meet the expensive one's bar. (2) Different change cadence — tenant config changes hourly, the data plane changes daily; deploying them together drags the data plane into the control plane's blast radius. (3) Different security boundaries — the control plane holds plaintext provider credentials and KMS unwrap operations; keeping it off the request path reduces attack surface. The "extra service" cost is real but trivial compared to a single cross-tenant leak incident.
Out of scope for the basic platform; pushed to the application layer. The
gateway exposes a chat.completion endpoint that hits one provider
per call; if the customer wants ensemble or cascade routing (try Haiku first,
escalate to Sonnet on low confidence), they implement that in their app
calling the gateway twice. The reason: ensemble logic is application-specific
and the gateway has no business deciding what "low confidence" means for a
given product.
(1) Sales hands off a signed contract with declared region (eu-central-1) and isolation tier (database-per-tenant). (2) Onboarding API call creates a tenant row, allocates a dedicated Postgres database in eu-central-1, asks the customer's AWS account to grant access to a CMK in their EU account. (3) Once the CMK grant is verified, we provision the tenant's vector index, default API keys, and quota records. (4) Provider config defaults to Bedrock in eu-central-1 with HIPAA BAA in effect. (5) Control plane pushes the new tenant's config snapshot to the eu-central-1 data plane; the tenant can make their first API call within ~25 min. (6) Audit log entry recorded with the operator who initiated, the contract reference, and the region pinning — this is the artifact compliance auditors want.