AWS Step Functions is a serverless orchestration service that coordinates AWS services into durable workflows. You define a state machine in Amazon States Language (ASL) — JSON describing tasks, choices, parallel branches, retries, and waits — and Step Functions executes it with built-in error handling, observability, and exactly-once state transitions. Standard workflows support runs up to one year; Express workflows handle high-throughput, short-lived runs at lower cost. Distributed Map adds high-fan-out parallel processing across thousands of items.
State machine in Amazon States Language with retries, a parallel branch, and a wait-for-callback step:
{
"Comment": "Order processing pipeline",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "validate-order",
"Payload.$": "$"
},
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}],
"Next": "ChargeAndShip"
},
"ChargeAndShip": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "Charge",
"States": {
"Charge": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "charge-card",
"Payload": {"taskToken.$": "$$.Task.Token", "input.$": "$"}
},
"End": true
}
}
},
{
"StartAt": "Ship",
"States": {
"Ship": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {"FunctionName": "ship-package", "Payload.$": "$"},
"End": true
}
}
}
],
"End": true
}
}
}
Start an execution from boto3:
import boto3, json
sfn = boto3.client("stepfunctions", region_name="us-west-2")
resp = sfn.start_execution(
stateMachineArn="arn:aws:states:us-west-2:123456789012:stateMachine:OrderPipeline",
input=json.dumps({"order_id": "A-9001", "user_id": 42}),
)
print(resp["executionArn"])
Standard for long-running, durable, exactly-once workflows (orchestration, approvals, ETL). Express for high-volume, short-lived event processing (per-request orchestration in APIs, IoT events). Express is dramatically cheaper at scale and supports both at-most-once (sync) and at-least-once (async) modes.
Distributed Map fans out parallel child executions over a dataset (S3 inventory, JSON array, CSV) with up to 10,000 concurrent children and millions of items. It replaces ad-hoc Lambda fan-out patterns and removes the 25,000-event-history limit by isolating each item in its own execution.
Use the .waitForTaskToken integration pattern. The task hands a token to a Lambda or notification step; the workflow pauses until SendTaskSuccess or SendTaskFailure is called with that token (e.g., from an approval link in email).
Each Task state can declare a Retry array (with backoff, max attempts, error filters) and a Catch array (error filters and a Next state). This eliminates Lambda-level retry plumbing and centralizes error policy in the state machine.
Optimized integrations (e.g., arn:aws:states:::lambda:invoke) provide special behavior like .sync (wait for completion) and .waitForTaskToken. SDK integrations (e.g., arn:aws:states:::aws-sdk:dynamodb:getItem) call any AWS API directly — fewer features, broader coverage.
State transitions cap input/output at 256 KB. For larger payloads, store data in S3 and pass references (the "claim check" pattern) between states.
Step Functions is the go-to AWS orchestrator for any workflow that needs more structure than chained Lambdas. With Distributed Map, Express workflows, and SDK integrations, it covers everything from per-request choreography to long-running, fan-out batch processing.