AI Engineer vs ML Engineer

The titles overlap, but in 2026 the two roles solve different problems with different tools. ML Engineers build and ship models. AI Engineers build systems on top of foundation models. This page lays out the divergence concretely — responsibilities, stack, evaluation, and team patterns — so you can staff (or apply for) the right role.

1. TL;DR

Dimension	ML Engineer	AI Engineer
Primary artifact	A trained model (.pt, .onnx, .pkl)	A prompted/chained system on top of an API
Core skill	Statistics, feature engineering, distributed training	Prompting, retrieval, agent orchestration, eval design
Compute concern	GPUs for training; latency and throughput at inference	Token cost, context window, rate limits, cache hit rate
Failure mode	Model drift, data leakage	Hallucination, prompt injection, tool-use loops
Iteration loop	Hours to weeks (training runs)	Minutes (prompt edits) to days (eval cycles)

2. Typical Responsibilities

ML Engineer

Own the data pipeline that feeds training: feature stores, versioned datasets, train/test split discipline.
Train and tune models — gradient boosting, classical NLP, vision, ranking, recommender systems.
Ship to production via batch scoring jobs, real-time inference services (Triton, TorchServe, SageMaker endpoints), or embedded models.
Monitor for distribution shift, retrain on a schedule, run A/B tests for model lift.
Optimize cost and latency: quantization, distillation, ONNX conversion, GPU autoscaling.

AI Engineer

Design prompts and prompt chains that compose into a working product feature.
Build retrieval pipelines: chunking, embeddings, vector stores, re-ranking, hybrid search.
Orchestrate tools and agents — function calling, MCP servers, multi-step workflows with state.
Define and run LLM evals: golden sets, regression suites, LLM-as-judge with calibration.
Manage cost: prompt caching, model routing (cheap model first, escalate on uncertainty), token budgets.
Defend against prompt injection, jailbreaks, and PII leakage.

3. Tech Stack: Overlap and Divergence

Both roles share Python, FastAPI, Docker, Kubernetes, observability (OpenTelemetry, Prometheus), and a cloud (AWS / GCP / Azure). After that they diverge.

ML Engineer Stack

# Typical imports
import torch
import torch.nn as nn
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import mlflow
import optuna
import ray

Modeling: PyTorch, scikit-learn, XGBoost, LightGBM, JAX (less common).
Experiment tracking: MLflow, Weights & Biases.
Distributed training: Ray, DeepSpeed, Horovod, FSDP.
Serving: Triton Inference Server, TorchServe, BentoML, SageMaker.
Feature store: Feast, Tecton, Databricks Feature Store.

AI Engineer Stack

# Typical imports
from anthropic import Anthropic
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from langchain_core.runnables import RunnableLambda
from llama_index.core import VectorStoreIndex
import instructor
from pydantic import BaseModel

Foundation models: Anthropic Claude, OpenAI GPT, Google Gemini, open-weight (Llama, Qwen, Mistral) via vLLM or Bedrock.
Orchestration: LangChain, LlamaIndex, Haystack, raw SDK calls (often the right answer).
Vector stores: pgvector, Qdrant, Chroma, FAISS, Pinecone.
Tool/agent protocols: Model Context Protocol (MCP), OpenAI function calling, Anthropic tool use.
Eval: Braintrust, LangSmith, Weights & Biases Weave, custom harnesses.
Structured output: Instructor, Outlines, JSON mode, Anthropic tool use as schema.

4. Training vs Prompting and RAG

The defining methodological difference: ML Engineers change weights; AI Engineers change context.

ML Engineer: "The model gets the wrong answer" → collect more labeled data, adjust loss, regularize, retrain. The unit of improvement is a checkpoint.
AI Engineer: "The model gets the wrong answer" → rewrite the system prompt, add a few-shot example, improve retrieval, add a tool, switch model tier. The unit of improvement is a prompt/config diff.

The line blurs in two places:

Fine-tuning small open-weight models (LoRA, QLoRA) is increasingly an AI Engineer task because the iteration loop matches prompting more than from-scratch training.
Embedding model training (contrastive fine-tuning of a retrieval model on your domain) sits squarely with ML Engineers but is consumed by AI Engineers downstream.

5. Evaluation Methods

ML Engineer Eval

Quantitative, offline, well-established metrics. The pipeline runs nightly and produces a leaderboard.

from sklearn.metrics import (
    roc_auc_score, precision_recall_curve, average_precision_score,
    confusion_matrix, ndcg_score
)

# Classification
auc = roc_auc_score(y_true, y_pred_proba)
ap  = average_precision_score(y_true, y_pred_proba)

# Ranking
ndcg = ndcg_score([y_true_rel], [y_pred_scores], k=10)

Standard metrics: AUROC, AUPRC, RMSE, MAPE, NDCG, Recall@K, calibration error.

AI Engineer Eval

Mixed: deterministic checks (does the JSON parse? does it cite a real source?) plus LLM-as-judge for open-ended quality. Goldens are small (50–500 examples) and hand-curated.

from anthropic import Anthropic

client = Anthropic()

JUDGE_PROMPT = """Score the assistant's answer 1-5 on factual grounding.
A 5 means every claim is supported by the provided context.
A 1 means the answer contradicts or invents facts.

Question: {q}
Context: {ctx}
Answer:   {a}

Return ONLY a single integer."""

def judge(q: str, ctx: str, a: str) -> int:
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4,
        messages=[{"role": "user",
                   "content": JUDGE_PROMPT.format(q=q, ctx=ctx, a=a)}],
    )
    return int(msg.content[0].text.strip())

Best practice: calibrate the judge against human labels on a sample, then run it as a regression suite on every prompt change.

6. Team Org Patterns

Embedded model: One AI or ML Engineer embedded in each product team. Fast iteration, but duplicated infrastructure.
Platform team: Centralized ML/AI platform group owns the gateway, vector store, eval harness, and feature store. Product teams consume via SDK. Scales well past ~5 product teams.
Hybrid: Platform team owns shared infra; embedded engineers own per-product prompts/models. This is the dominant pattern in 2026 mid-to-large orgs.
Research / applied split: Research ML trains and releases foundation or domain models; applied AI Engineers integrate them into products. Common at companies that ship their own models.

7. Which Role Should You Hire?

Hire an ML Engineer when:

You have proprietary labeled data and the value comes from learning patterns in it (fraud, recommendation, ranking, forecasting).
Your latency budget is <100ms and a foundation model API will not fit.
The problem is structured (tabular, time-series, vision) and a smaller specialized model will outperform a general LLM at 1/100th the cost.

Hire an AI Engineer when:

The product surface is natural language (chat, summarization, extraction, agentic workflows).
You can describe the task in a paragraph and a frontier model already does it ~80% well — you need someone to close the gap.
You are integrating with tools, APIs, or knowledge bases via retrieval and function calling.

Most product teams in 2026 need both. The mistake is assuming one can do the other's job: an ML Engineer who has never run an LLM eval will ship hallucinations; an AI Engineer who has never trained a model will reach for a $50k/month API call when a 100MB XGBoost model would have worked.