Multi-Tenant LLM Platform: BYO-Data, BYO-Model, Per-Tenant Cost

The brief: build a SaaS platform where each customer can bring their own data, choose their model provider (Bedrock, Azure OpenAI, or self-hosted vLLM), and receive isolated cost and usage reporting. Customer A is a US healthcare provider that requires HIPAA-eligible Bedrock in us-east-1. Customer B is a German bank that requires Azure OpenAI in EU regions with customer-managed keys. Customer C is a defense contractor that requires fully on-prem inference. Same product, three deployment realities.

Multi-tenancy looks like a feature checkbox until you have to debug a noisy-neighbor issue at 2 AM, or explain to a customer why their bill jumped 4×, or convince a security auditor that tenant A's keys cannot decrypt tenant B's data even with root on the database host. This design is opinionated about all three.

1. Problem & Functional Requirements

Onboard a new tenant in < 30 minutes including KMS key creation, RAG namespace, and rate-limit defaults.
Support per-tenant data residency: an EU tenant's data and inference must never leave the EU region.
Support per-tenant model selection: a single API call routes to Bedrock for tenant A, Azure OpenAI for tenant B, on-prem vLLM for tenant C — transparent to the calling application.
Per-tenant quotas (RPM, TPM, monthly token spend) with hard and soft limits.
Per-tenant cost reporting at the granularity of user / project / API key.
All tenant data encrypted with a customer-managed key the tenant can rotate or revoke unilaterally.

2. Non-Functional Requirements & SLOs

Metric	Target
Routing overhead added to LLM call	< 25 ms p95
Quota check latency	< 5 ms p95
Tenant onboarding time	< 30 min, fully automated
Cost report freshness	< 5 min from request
Cross-tenant data leak rate	Zero, enforced at multiple layers
Availability (data plane)	99.95% per region
Availability (control plane)	99.9% (it can be down briefly without taking inference down)

3. Capacity Estimates

Tenants and traffic. Plan for 500 paying tenants, average 10 RPS each at peak, top-1% tenants at 200 RPS. Aggregate ~5,000 sustained RPS, ~15,000 peak. Average request: 2,000 input tokens + 500 output tokens.

Token volume. 5,000 RPS × 2,500 tokens/req ≈ 12.5M tokens/sec aggregate ≈ 1 trillion tokens/day. Of that, ≈ 70% routed to managed providers (Bedrock/Azure), ≈ 30% to self-hosted vLLM.

Per-tenant data. Average 100k documents at 4 KB ≈ 400 MB text per tenant, 4 GB after embeddings. 500 tenants ≈ 2 TB total at steady state; allow 10 TB headroom for data-heavy tenants.

Self-hosted GPU pool. 30% of 12.5M tokens/sec = 3.75M tokens/sec on vLLM. At Llama 3.1 70B FP8 throughput of ~3,000 tokens/sec/H100 batched, need ~1,250 H100s. That is implausible budget-wise; the design assumption is that self-hosted is an opt-in tier reserved for ~5% of traffic. Right-size for ~200 H100s aggregate, or ~$5M/year of GPU lease.

4. Control Plane vs Data Plane

The single most important design choice in a multi-tenant platform is the split between control plane (slow, transactional, can briefly be down) and data plane (fast, hot path, must never be down).

+======================== CONTROL PLANE ========================+
|  + tenant onboarding API     + KMS key provisioning           |
|  + quota / billing config    + provider credential storage    |
|  + audit / compliance UI     + cost reporting aggregation     |
|  Backed by: Postgres (tenants), Vault (secrets), S3 (logs)    |
|  SLO: 99.9% (down for 5 min = no new tenants, inference fine) |
+================================================================+
                        |
              policy + creds pushed via
              versioned snapshots (S3 + ETag)
                        v
+========================= DATA PLANE ==========================+
|  + API gateway + auth + rate limit                            |
|  + provider router (Bedrock / Azure / vLLM)                   |
|  + per-tenant vector index access                             |
|  + token-counting + per-request cost emission                 |
|  Backed by: per-region Redis, Postgres replicas, Kafka        |
|  SLO: 99.95% (every 1 min down = real customer impact)        |
+================================================================+

The data plane caches everything it needs from the control plane: tenant config, quota state, provider credentials, model routing rules. If the control plane is down, inference continues using the last cached snapshot. New tenant onboarding pauses; existing tenants are untouched.

Provider credentials are pushed via short-lived signed snapshots from the control plane to each data-plane region; the data plane never queries Vault on the hot path.

5. Data Model & Tenant Isolation

Three isolation models exist; pick deliberately based on the tenant's risk profile, not as a one-size-fits-all default.

Model	Isolation	Cost	When to use
Shared schema + RLS	App-layer + DB row-level security	Cheapest	SMB tier, no regulatory ask
Schema-per-tenant	DB-level, separate ACLs	Moderate	Enterprise; auditable boundary
Database-per-tenant	Physical, separate KMS key	Highest	Regulated (HIPAA, ITAR, EU sovereignty)

The default is schema-per-tenant; promote to database-per-tenant on contractual requirement. Below is the schema-per-tenant skeleton:

-- Control-plane catalog
CREATE TABLE tenants (
    tenant_id        UUID PRIMARY KEY,
    name             TEXT NOT NULL,
    region           TEXT NOT NULL,            -- us-east-1, eu-central-1, on-prem-dc1
    isolation_model  TEXT NOT NULL,            -- 'rls', 'schema', 'database'
    schema_name      TEXT,                     -- e.g. 'tenant_a1b2c3'
    kms_key_arn      TEXT NOT NULL,            -- customer-managed key
    provider_config  JSONB NOT NULL,           -- {bedrock: {...}, azure: {...}, vllm: {...}}
    quota_rpm        INT NOT NULL,
    quota_tpm        BIGINT NOT NULL,
    monthly_budget_usd NUMERIC NOT NULL,
    created_at       TIMESTAMPTZ DEFAULT now()
);

-- Per-tenant schema (created on onboarding, identical structure)
CREATE SCHEMA tenant_a1b2c3;
SET search_path TO tenant_a1b2c3;
CREATE TABLE documents (...);
CREATE TABLE chunks    (...);
CREATE TABLE api_keys  (...);
CREATE TABLE usage     (request_id UUID, ts TIMESTAMPTZ, model TEXT,
                        input_tokens INT, output_tokens INT, cost_usd NUMERIC);

Encryption. Every per-tenant Postgres schema has its tablespace encrypted with the tenant's customer-managed KMS key (CMK) via envelope encryption. The platform holds only the wrapped data encryption key (DEK); revoking the CMK in the customer's AWS account makes the entire schema unreadable, instantly. This is the design feature compliance teams care most about and is hard to bolt on later.

Vector indexes. Per-tenant chunks table inside the tenant schema means each tenant's HNSW index is physically separate. No cross-tenant query is possible by construction; no RLS predicate to forget.

6. Critical Path: Per-Tenant Routing

async def chat_completion(req: ChatRequest, api_key: str) -> ChatResponse:
    # 1. Auth + tenant lookup (cached, ~1 ms)
    tenant = await tenant_cache.lookup_by_key(api_key)
    if tenant is None:
        raise HTTPException(401)

    # 2. Quota check (Redis token bucket, ~2 ms)
    if not await quota.try_consume(
        tenant.id, rpm=1, tpm=estimate_tokens(req)
    ):
        raise HTTPException(429, headers=quota.retry_after(tenant.id))

    # 3. Region pin: refuse if request hit the wrong region
    if tenant.region != CURRENT_REGION:
        raise HTTPException(421, detail=f"redirect to {tenant.region}")

    # 4. Provider routing (~1 ms, in-memory)
    provider = router.pick(tenant, req.model)
    # provider is one of: BedrockProvider, AzureProvider, VLLMProvider

    # 5. Decrypt provider creds with tenant CMK (cached short-lived, ~3 ms first call)
    creds = await creds_cache.get(tenant.id, provider.name)

    # 6. Forward call (the 99% of latency)
    start = time.monotonic_ns()
    resp = await provider.complete(req, creds)
    latency_ns = time.monotonic_ns() - start

    # 7. Emit usage event to Kafka (fire-and-forget, non-blocking)
    usage_emitter.emit(UsageEvent(
        tenant_id=tenant.id,
        api_key=api_key,
        model=resp.model,
        input_tokens=resp.usage.input_tokens,
        output_tokens=resp.usage.output_tokens,
        cost_usd=price.compute(provider, resp.usage),
        latency_ns=latency_ns,
    ))

    return resp

The routing rules live in tenant.provider_config as JSONB:

provider_config:
  default_model: claude-3-5-sonnet
  routes:
    - model_pattern: "claude-*"
      provider: bedrock
      region: us-east-1
      role_arn: arn:aws:iam::123456789012:role/tenant-a-bedrock
    - model_pattern: "gpt-4*"
      provider: azure
      endpoint: https://tenant-a.openai.azure.com
      deployment: gpt-4o
    - model_pattern: "llama-3.3-70b"
      provider: vllm
      endpoint: https://vllm-pool-1.internal:8000
  fallback:
    when: provider_5xx
    use: bedrock/claude-3-5-haiku

7. Scaling & Bottlenecks

Noisy neighbor on shared GPU pool. One tenant launches a batch job that fills vLLM's continuous batch with their requests; everyone else queues. Mitigation: per-tenant fair-queueing in the vLLM gateway, with weighted round-robin proportional to paid tier. At extreme load, run a separate vLLM pool for the offending tenant.
Quota state contention. A token bucket in a single Redis key for a high-RPS tenant becomes the hottest key in the cluster. Mitigation: sharded buckets (8 sub-buckets per tenant, sum periodically); or move to local in-process buckets reconciled to Redis every 100 ms.
Per-tenant Postgres connection pressure. 500 tenants × even 5 connections each = 2,500 connections, well above Postgres' practical limit. Mitigation: PgBouncer in front of Postgres, transaction-pooling mode, connections dynamically allocated; or one Postgres cluster per region with schema-per-tenant inside a single connection pool.
Cost-attribution lag. Kafka consumer aggregating usage events lags during traffic spikes; tenants see stale dashboards. Mitigation: process events in two paths — one fast (Redis incr per tenant per minute) for live dashboards, one slow (Kafka → ClickHouse) for billing-grade accuracy.
Provider rate limits. Bedrock and Azure both throttle at the account level, not per tenant. Multiple tenants on the same Bedrock account share one bucket. Mitigation: provision per-tenant inference profiles or separate AWS accounts for top-tier tenants; per-tenant Provisioned Throughput Units (PTUs) for predictable capacity.

8. Failure Modes & Resilience

Tenant CMK revoked. Postgres tablespace becomes unreadable. The data plane returns 503 with a clear error message; control plane alerts the customer success team. Recovery is on the customer (re-grant key access) by design.
Control plane down. Data plane keeps serving from cached tenant config (TTL 1 h). New tenants cannot onboard; existing tenants are unaffected. This is the whole reason for the split.
Provider outage (e.g. Bedrock us-east-1). Per-tenant fallback rules kick in; tenants who opted into fallback get degraded service on a different model; tenants who pinned a single provider get explicit errors (their choice).
Quota Redis unavailable. Fail open with conservative in-process limits (10 RPS per tenant) and a loud alert. Better to over-serve briefly than to hard-fail every request.
Usage Kafka backlogged. Cost dashboards lag but no requests drop. The fast-path Redis counter still works for live quota enforcement.

Idempotency. Every chat request requires an idempotency key header. The router uses it as a Redis key with 24 h TTL pointing at the prior response; replays return the cached response without re-billing the customer or re-calling the provider.

9. Cost Analysis & Attribution

Per 1M tokens routed (averaged across providers):

provider_pass_through:    $3.00   # weighted avg of Bedrock/Azure/vLLM
platform_overhead:
  api_gateway:            $0.10
  routing_compute:        $0.05
  quota_redis:            $0.02
  usage_pipeline:         $0.05
  storage_postgres:       $0.08
  audit_logs_s3:          $0.03
---
cost_to_us:               $3.33
list_price_to_customer:   $4.50  # 35% margin

Self-hosted vLLM tier is fixed-cost per H100 (~$2/hr lease) regardless of volume; the marginal token cost is essentially zero, so the platform charges a flat per-PTU monthly fee plus a small overage rate.

Cost attribution flow:

request --> data plane --> emit UsageEvent (Kafka)
                                |
            +-------------------+-------------------+
            |                                       |
            v                                       v
   Redis incr by tenant_id           ClickHouse insert (sharded)
   (live dashboard, 1s lag)          (billing source of truth)
            |                                       |
            v                                       v
   /usage/live API endpoint           nightly invoice generator
                                      reconciliation vs Stripe

Two pipelines, two consistency guarantees: live dashboards prioritize speed, billing prioritizes accuracy. They will diverge by < 1% in normal operation; billing always wins in disputes.

10. Tradeoffs & Alternatives

Schema-per-tenant vs RLS vs database-per-tenant. RLS is fine for SMB and removes operational complexity, but a single missed SET app.tenant_id in a query path becomes a cross-tenant leak. Schema-per-tenant moves isolation to the database boundary — the wrong search path returns "relation does not exist" rather than "leaks data," which is a much easier failure mode to detect. Database-per-tenant is required for HIPAA/ITAR/EU-sovereignty tenants and gives the strongest CMK story but multiplies operational cost. We use all three; the tier is set per tenant, not per platform.

Per-tenant model routing in the gateway vs in the application. Doing it in the gateway means every tenant gets the same routing logic and cost emission for free, and applications stay simple. Doing it in the application gives finer control (e.g. retry to a cheaper model on long inputs) but duplicates logic across teams. The platform should provide both: a default gateway route and an opt-in "I'll route myself" mode for sophisticated customers.

Centralized vs per-region control plane. Centralized is simpler but creates a cross-region dependency for new tenant onboarding in EU and on-prem deployments. Per-region replicates the control plane and uses a global tenant catalog (DynamoDB Global Tables) as the only cross-region surface. The right answer for a regulated platform is per-region; the small extra cost is worth removing the cross-region dependency for compliance.

KMS keys per tenant vs shared. Shared keys are operationally trivial but make the "what happens when a customer asks us to delete all their data" question much harder. Per-tenant CMKs give cryptographic erasure: revoke the key, the data is gone in any meaningful sense, and you do not have to wait for backups to expire. This is the single most important design choice for compliance posture.

Local vLLM vs hosted Bedrock for self-hosted tier. If a tenant needs on-prem, vLLM is the right answer; you control the GPU pool, the model, and the inference SLO. The cost is operational: GPU drivers, NCCL tuning, CUDA upgrades, OOM debugging, all on you. For tenants who only need data sovereignty (not air-gap), Bedrock in their preferred region with their CMK is dramatically cheaper to operate than running your own vLLM pool.

11. Common Interview Q&A

Q1: How do you stop a single tenant from monopolizing GPU capacity on the shared vLLM pool?

Two layers. At the gateway, per-tenant token-bucket admission control caps concurrent requests by paid tier. Inside vLLM, a per-tenant priority class plumbed into the continuous-batching scheduler ensures fair share of the batch even when one tenant fills the queue. If a tenant exceeds their tier sustained for 15 min, an autoscaler spins them their own dedicated pool and the gateway routes them there. The user-visible contract is: paid tier guarantees a minimum throughput, never a maximum.

Q2: A customer rotates their KMS key. What happens to in-flight requests?

Two scenarios. If they rotate (create a new version, mark old as not-primary), AWS KMS keeps the old version available for decryption; in-flight requests succeed and the next encryption uses the new version. If they revoke the key entirely, the next decryption fails and the data plane returns 503 with a clear error. Active connections drain over the next ~30 s as cached DEKs expire. We notify the customer in the dashboard and require an explicit confirmation in the rotation flow to prevent accidental revocation.

Q3: How do you bill for the self-hosted vLLM tier when there are no per-token costs?

Sell provisioned throughput units — each PTU guarantees, say, 1,000 tokens/sec sustained throughput. Map that to a fraction of a real GPU (an H100 doing Llama 3.3 70B FP8 yields ~3,000 tokens/sec, so 1 PTU = ~0.33 H100). The customer pays a flat monthly rate per PTU; overage above their PTU is throttled or billed at a higher per-token rate. This aligns with how Bedrock and Azure already sell PTUs, so customers understand the model.

Q4: Why split control plane from data plane? Isn't it just more services?

Three reasons. (1) Different SLOs — the data plane must be 99.95%, the control plane only 99.9%; conflating them forces the cheaper one to meet the expensive one's bar. (2) Different change cadence — tenant config changes hourly, the data plane changes daily; deploying them together drags the data plane into the control plane's blast radius. (3) Different security boundaries — the control plane holds plaintext provider credentials and KMS unwrap operations; keeping it off the request path reduces attack surface. The "extra service" cost is real but trivial compared to a single cross-tenant leak incident.

Q5: How do you handle a tenant whose request needs to fan out to multiple providers?

Out of scope for the basic platform; pushed to the application layer. The gateway exposes a chat.completion endpoint that hits one provider per call; if the customer wants ensemble or cascade routing (try Haiku first, escalate to Sonnet on low confidence), they implement that in their app calling the gateway twice. The reason: ensemble logic is application-specific and the gateway has no business deciding what "low confidence" means for a given product.

Q6: Walk me through onboarding a new EU healthcare tenant.

(1) Sales hands off a signed contract with declared region (eu-central-1) and isolation tier (database-per-tenant). (2) Onboarding API call creates a tenant row, allocates a dedicated Postgres database in eu-central-1, asks the customer's AWS account to grant access to a CMK in their EU account. (3) Once the CMK grant is verified, we provision the tenant's vector index, default API keys, and quota records. (4) Provider config defaults to Bedrock in eu-central-1 with HIPAA BAA in effect. (5) Control plane pushes the new tenant's config snapshot to the eu-central-1 data plane; the tenant can make their first API call within ~25 min. (6) Audit log entry recorded with the operator who initiated, the contract reference, and the region pinning — this is the artifact compliance auditors want.