Pipeline Diagram — How Most Python Code Actually Works
Most non-trivial Python is a pipeline: read something → transform it → write something. ML training, ETL, scrapers, web request handlers, batch jobs, RAG systems. The natural diagram is left-to-right or top-to-bottom boxes connected by arrows showing the data flowing through stages. Drawing a class diagram for a sklearn pipeline is the wrong abstraction — it's three classes, but the meaningful structure is the data flowing through them in order.
1. Linear Pipeline (pandas, sklearn, generators)
The simplest and most common shape: each stage takes one input, returns one output, hands off to the next stage. Use this for pandas chains, sklearn Pipeline, generator pipelines, and short ETL jobs.
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ Pipeline Diagram — Linear Data Transformation (pandas / sklearn) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ read_csv │──▶│ filter │──▶│ groupby │──▶│ to_csv │ │
│ │ (raw input) │ │ (drop nulls) │ │ (aggregate) │ │ (output) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Each stage is a function that takes a DataFrame and returns a DataFrame. │
│ The pipeline IS the data flow; module structure (one .py file per stage) │
│ matters less than the STAGE NAMES and TYPES flowing between them. │
│ │
│ Equivalent code: │
│ │
│ df = (pd.read_csv("input.csv") │
│ .pipe(filter_nulls) │
│ .pipe(group_and_aggregate) │
│ .pipe(write_to_csv, path="output.csv")) │
└──────────────────────────────────────────────────────────────────────────────────────────┘
What to show on each box
- Stage name — the function being called (verb-form: filter, tokenize, predict)
- Output type / shape — what the next stage receives. For pandas:
(rows × cols)shape. For sklearn:X.shape. For generators: yielded type. - Cardinality changes — does the stage filter (rows ↓), expand (rows ↑), aggregate (rows ↓ to groups)? Annotate when significant.
sklearn equivalent
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scale", StandardScaler()), # X.shape unchanged
("pca", PCA(n_components=10)), # X.shape (n, 10)
("clf", LogisticRegression()), # predict_proba returns (n, n_classes)
])
pipe.fit(X_train, y_train)
The diagram for this is the same shape: 3 boxes labeled scale, pca, clf with arrows between them. The code uses class instances; the diagram shows the data flow through them.
2. Branching DAG (Airflow, Prefect, Dagster)
When stages can run in parallel — fan-out for independent transformations, fan-in to combine results — a linear pipeline becomes a directed acyclic graph (DAG). This is the model behind Airflow, Prefect, Dagster, Luigi, and Snowflake's task graphs.
┌──────────────────────────────────────────────────────────────────────────────────────────┐ │ Pipeline DAG — Branching with Fan-out / Fan-in (Airflow / Prefect) │ │ │ │ ┌──────────────────┐ │ │ │ extract_orders │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────────────┼────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ │ enrich_with_user │ │ tag_currency │ │ score_fraud_risk │ │ │ └────────┬─────────┘ └──────┬───────┘ └────────┬─────────┘ │ │ │ │ │ │ │ └───────────────────┼───────────────────┘ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ load_warehouse │ │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ notify_slack │ (on success or failure) │ │ └──────────────────┘ │ │ │ │ Each node is a @task function. Edges are explicit: enrich >> load. │ │ Fan-out (3 parallel enrichment tasks) and fan-in (single load_warehouse) │ │ are the two key patterns Airflow / Prefect / Dagster optimize for. │ └──────────────────────────────────────────────────────────────────────────────────────────┘
Equivalent Airflow / Prefect code
# Airflow 2.x with TaskFlow API
from airflow.decorators import dag, task
from datetime import datetime
@dag(start_date=datetime(2026, 1, 1), schedule="@daily")
def order_enrichment():
@task
def extract_orders(): ...
@task
def enrich_with_user(orders): ...
@task
def tag_currency(orders): ...
@task
def score_fraud_risk(orders): ...
@task
def load_warehouse(*enriched): ...
@task
def notify_slack(load_result): ...
orders = extract_orders()
enriched = [enrich_with_user(orders),
tag_currency(orders),
score_fraud_risk(orders)]
loaded = load_warehouse(*enriched)
notify_slack(loaded)
dag = order_enrichment()
The diagram is the DAG — Airflow's UI shows almost exactly this picture, generated from the code dependencies between @task calls.
Prefect equivalent
from prefect import flow, task
@task
def extract_orders(): ...
@task
def enrich_with_user(orders): ...
# ... etc
@flow
def order_enrichment():
orders = extract_orders()
fan_out = [enrich_with_user(orders),
tag_currency(orders),
score_fraud_risk(orders)]
loaded = load_warehouse(fan_out)
notify_slack(loaded)
3. Multi-Stage ML / RAG Pipeline
When the pipeline has many stages (8+) and each stage has a meaningful sub-label (model name, library, parameter), a vertical layout reads better than horizontal. Useful for ML training, RAG ingestion, transformer fine-tuning, agent loops.
┌──────────────────────────────────────────────────────────────────────────────────────────┐ │ Pipeline Diagram — RAG / ML Multi-Stage (LLM Application) │ │ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ PDFs / Word / S3 Bucket (raw documents) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Document Loader (PyMuPDF, Textract) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Chunker (semantic, 512 tokens) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Embedder (bge-large / OpenAI) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Vector Store (pgvector / Qdrant) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Hybrid Retriever (BM25 + dense + RRF) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Cross-Encoder Reranker (bge-reranker-large) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ LLM (Claude / GPT-5) (grounded answer + citations) │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ Same pattern works for sklearn.Pipeline, transformer training loops, │ │ data engineering ETL — wherever data flows through transformations. │ └──────────────────────────────────────────────────────────────────────────────────────────┘
Why this beats a class diagram for the same code
The code behind this RAG pipeline might be 10 small classes (PDFLoader, SemanticChunker, BGEEmbedder, etc.) — but a class diagram tells you nothing about how they're wired together. The pipeline diagram tells you exactly:
- The order operations happen
- What each stage produces (the small text on the right of each box)
- Where you'd swap a component (the labels are model/library names you can change)
- Where the bottleneck is likely to be (typically embedding or rerank)
4. Box Labels, Stage Types, and Annotations
| Stage type | Convention |
|---|---|
| Source | Cylinder shape, or rectangle labeled "(raw input)" / "(database)" |
| Pure transformation | Rectangle, function name in present tense (filter, tokenize) |
| Side effect (write) | Rectangle labeled "(output)" or with destination type in subscript |
| External call | Dashed border, label includes the service ("OpenAI API", "S3") |
| Branch / fan-out | Single arrow splits into N arrows; arrowhead per branch |
| Join / fan-in | N arrows converge; downstream stage takes a list/tuple |
| Optional / conditional | Box with a guard label like "[only if needs_review]" |
| Loop / retry | Wrap stage in a fragment box labeled "loop until success" |
5. Common Pitfalls When Drawing Pipelines
- Hiding parallelism. If two stages run in parallel, draw them side-by-side with separate arrows — not as one big "process" box. Concurrency is the most important property to surface.
- Mixing levels of abstraction. Don't have one box labeled
read_csvand another labeledtrain_full_ml_pipeline. Pick a consistent zoom level. Decompose the big box into its own diagram if you need detail. - Omitting the data type / shape. The most useful annotation on a pipeline arrow is what flows across it —
DataFrame[user_id, age],list[Document],np.ndarray (n, 768). Without this, the diagram is just a flowchart. - Drawing every helper function. If
filter_nullscalls a helperis_valid_row, don't putis_valid_rowon the diagram. The pipeline is meant for the conceptual stages; details belong in code. - Forgetting error paths. If a stage can fail and trigger a different downstream path, show it. Otherwise the diagram lies about the system.
Common Interview Questions:
When is a pipeline diagram better than a sequence diagram?
Pipeline diagrams show data transformation; sequence diagrams show messaging between participants. Use a pipeline when one process moves data through stages (sklearn fit/predict, RAG ingest). Use a sequence diagram when multiple actors/services exchange messages (browser → API → DB → cache).
What's the difference between an Airflow DAG and a sklearn Pipeline?
Airflow DAGs run independent tasks across processes/machines on a schedule, with retries, alerting, and a persistent execution history. sklearn Pipelines run synchronously in one Python process, fitting/transforming data in memory. The diagram shape is the same (DAG); the runtime is completely different. Don't confuse them in an architecture review.
How do I show backpressure / streaming in a pipeline diagram?
Add a buffer/queue node between producer and consumer stages — show it as a cylinder or annotated rectangle ("Kafka topic", "asyncio.Queue", "channel"). Label it with the buffer's bounding behavior ("max 1000 items, drops oldest" or "blocks producer when full").
Should I draw a pipeline diagram for a simple ETL with three pandas operations?
Probably not. If the whole job fits in 10 lines and the chain is obvious from the code, skip the diagram. Diagrams pay off when the pipeline has 5+ stages, parallelism, branching, or external dependencies — and when more than one person needs to reason about it.
How do I keep a pipeline diagram in sync with the code?
For Airflow/Prefect/Dagster, use the auto-generated UI — that IS the diagram, derived from code. For pandas/sklearn, generate a diagram from the Pipeline.steps attribute or annotate stages with docstrings and use sklearn.utils.estimator_html_repr. For everything else, treat the diagram as documentation and update it on architectural changes (not on every refactor).
What's a "DAG anti-pattern" in pipelines?
Common ones: (a) tasks that read/write the same external state without coordination (race conditions), (b) tasks with hidden dependencies via shared files (the dependency exists but isn't in the DAG), (c) one task doing too much (no way to retry just the failed step), (d) cycles that the framework rejects but the developer keeps trying to add.