AI Systems Design

Staff-level system design walk-throughs for AI/ML systems. Each page is an end-to-end design exercise — the kind asked in a Staff or Principal Engineer interview — covering functional and non-functional requirements, back-of-envelope capacity math, component choices with explicit tradeoffs, critical-path sequence walk-throughs, failure modes, and per-thousand-request cost analysis.

The opinions are mine. Where I write "I would" or "we choose," I mean it: these are the picks I would defend in a design review for a real production system, not a survey of every option on the market.

Designs

Large-Scale RAG — serving 10M documents to 100k users with p95 < 2s. Hybrid BM25 + dense + cross-encoder rerank, pgvector vs Pinecone vs Qdrant, embedding refresh, tenant isolation, query-result caching.
Multi-Tenant LLM Platform — per-customer data, model choice, cost attribution, and KMS keys. Control plane vs data plane split; row-level security vs schema-per-tenant vs separate indexes; per-tenant rate limits and provider routing.
Real-Time Embedding Pipeline — re-embedding 1M documents/day with sub-minute freshness. CDC from Postgres via Debezium → Kafka → vLLM-served bge-large workers → idempotent upserts; backpressure, DLQs, cost-quality knobs.
Self-Hosted LLM Inference Service — Llama 3.3 70B at 200 concurrent requests with p95 < 4s. vLLM with FP8 quantization, continuous batching, K/V cache math, GPU sizing (H100 vs L40S), autoscaling on queue depth, OpenAI-compatible gateway.

How to read these

Every design page follows the same eleven-section skeleton: problem statement, SLOs, capacity math, architecture, data model, critical paths, scaling bottlenecks, failure modes, cost analysis, tradeoffs, and a final block of six collapsible interview Q&A pairs. If you are using these to prep for an interview, the Q&A block is the one to read out loud.