AI Systems Design
Staff-level system design walk-throughs for AI/ML systems. Each page is an
end-to-end design exercise — the kind asked in a Staff or Principal
Engineer interview — covering functional and non-functional requirements,
back-of-envelope capacity math, component choices with explicit tradeoffs,
critical-path sequence walk-throughs, failure modes, and per-thousand-request
cost analysis.
The opinions are mine. Where I write "I would" or "we choose," I mean it: these
are the picks I would defend in a design review for a real production system,
not a survey of every option on the market.
Designs
-
Large-Scale RAG
— serving 10M documents to 100k users with p95 < 2s. Hybrid BM25 + dense
+ cross-encoder rerank, pgvector vs Pinecone vs Qdrant, embedding refresh,
tenant isolation, query-result caching.
-
Multi-Tenant LLM Platform
— per-customer data, model choice, cost attribution, and KMS keys. Control
plane vs data plane split; row-level security vs schema-per-tenant vs separate
indexes; per-tenant rate limits and provider routing.
-
Real-Time Embedding Pipeline
— re-embedding 1M documents/day with sub-minute freshness. CDC from
Postgres via Debezium → Kafka → vLLM-served bge-large workers →
idempotent upserts; backpressure, DLQs, cost-quality knobs.
-
Self-Hosted LLM Inference Service
— Llama 3.3 70B at 200 concurrent requests with p95 < 4s. vLLM with FP8
quantization, continuous batching, K/V cache math, GPU sizing (H100 vs L40S),
autoscaling on queue depth, OpenAI-compatible gateway.
How to read these
Every design page follows the same eleven-section skeleton: problem statement,
SLOs, capacity math, architecture, data model, critical paths, scaling
bottlenecks, failure modes, cost analysis, tradeoffs, and a final block of six
collapsible interview Q&A pairs. If you are using these to prep for an
interview, the Q&A block is the one to read out loud.