AI Engineer vs ML Engineer

The titles overlap, but in 2026 the two roles solve different problems with different tools. ML Engineers build and ship models. AI Engineers build systems on top of foundation models. This page lays out the divergence concretely — responsibilities, stack, evaluation, and team patterns — so you can staff (or apply for) the right role.


1. TL;DR

Dimension ML Engineer AI Engineer
Primary artifact A trained model (.pt, .onnx, .pkl) A prompted/chained system on top of an API
Core skill Statistics, feature engineering, distributed training Prompting, retrieval, agent orchestration, eval design
Compute concern GPUs for training; latency and throughput at inference Token cost, context window, rate limits, cache hit rate
Failure mode Model drift, data leakage Hallucination, prompt injection, tool-use loops
Iteration loop Hours to weeks (training runs) Minutes (prompt edits) to days (eval cycles)

2. Typical Responsibilities

ML Engineer

AI Engineer


3. Tech Stack: Overlap and Divergence

Both roles share Python, FastAPI, Docker, Kubernetes, observability (OpenTelemetry, Prometheus), and a cloud (AWS / GCP / Azure). After that they diverge.

ML Engineer Stack

# Typical imports
import torch
import torch.nn as nn
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import mlflow
import optuna
import ray

AI Engineer Stack

# Typical imports
from anthropic import Anthropic
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from langchain_core.runnables import RunnableLambda
from llama_index.core import VectorStoreIndex
import instructor
from pydantic import BaseModel

4. Training vs Prompting and RAG

The defining methodological difference: ML Engineers change weights; AI Engineers change context.

The line blurs in two places:

  1. Fine-tuning small open-weight models (LoRA, QLoRA) is increasingly an AI Engineer task because the iteration loop matches prompting more than from-scratch training.
  2. Embedding model training (contrastive fine-tuning of a retrieval model on your domain) sits squarely with ML Engineers but is consumed by AI Engineers downstream.

5. Evaluation Methods

ML Engineer Eval

Quantitative, offline, well-established metrics. The pipeline runs nightly and produces a leaderboard.

from sklearn.metrics import (
    roc_auc_score, precision_recall_curve, average_precision_score,
    confusion_matrix, ndcg_score
)

# Classification
auc = roc_auc_score(y_true, y_pred_proba)
ap  = average_precision_score(y_true, y_pred_proba)

# Ranking
ndcg = ndcg_score([y_true_rel], [y_pred_scores], k=10)

Standard metrics: AUROC, AUPRC, RMSE, MAPE, NDCG, Recall@K, calibration error.

AI Engineer Eval

Mixed: deterministic checks (does the JSON parse? does it cite a real source?) plus LLM-as-judge for open-ended quality. Goldens are small (50–500 examples) and hand-curated.

from anthropic import Anthropic

client = Anthropic()

JUDGE_PROMPT = """Score the assistant's answer 1-5 on factual grounding.
A 5 means every claim is supported by the provided context.
A 1 means the answer contradicts or invents facts.

Question: {q}
Context: {ctx}
Answer:   {a}

Return ONLY a single integer."""

def judge(q: str, ctx: str, a: str) -> int:
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4,
        messages=[{"role": "user",
                   "content": JUDGE_PROMPT.format(q=q, ctx=ctx, a=a)}],
    )
    return int(msg.content[0].text.strip())

Best practice: calibrate the judge against human labels on a sample, then run it as a regression suite on every prompt change.


6. Team Org Patterns


7. Which Role Should You Hire?

Hire an ML Engineer when:

Hire an AI Engineer when:

Most product teams in 2026 need both. The mistake is assuming one can do the other's job: an ML Engineer who has never run an LLM eval will ship hallucinations; an AI Engineer who has never trained a model will reach for a $50k/month API call when a 100MB XGBoost model would have worked.