AWS Step Functions

AWS Step Functions is a serverless orchestration service that coordinates AWS services into durable workflows. You define a state machine in Amazon States Language (ASL) — JSON describing tasks, choices, parallel branches, retries, and waits — and Step Functions executes it with built-in error handling, observability, and exactly-once state transitions. Standard workflows support runs up to one year; Express workflows handle high-throughput, short-lived runs at lower cost. Distributed Map adds high-fan-out parallel processing across thousands of items.


Key Features:


Common Use Cases:


Service Limits & Quotas:


Pricing Model:


Code Example:

State machine in Amazon States Language with retries, a parallel branch, and a wait-for-callback step:


{
  "Comment": "Order processing pipeline",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "validate-order",
        "Payload.$": "$"
      },
      "Retry": [{
        "ErrorEquals": ["States.TaskFailed"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2.0
      }],
      "Next": "ChargeAndShip"
    },
    "ChargeAndShip": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "Charge",
          "States": {
            "Charge": {
              "Type": "Task",
              "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
              "Parameters": {
                "FunctionName": "charge-card",
                "Payload": {"taskToken.$": "$$.Task.Token", "input.$": "$"}
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "Ship",
          "States": {
            "Ship": {
              "Type": "Task",
              "Resource": "arn:aws:states:::lambda:invoke",
              "Parameters": {"FunctionName": "ship-package", "Payload.$": "$"},
              "End": true
            }
          }
        }
      ],
      "End": true
    }
  }
}
  

Start an execution from boto3:


import boto3, json

sfn = boto3.client("stepfunctions", region_name="us-west-2")

resp = sfn.start_execution(
    stateMachineArn="arn:aws:states:us-west-2:123456789012:stateMachine:OrderPipeline",
    input=json.dumps({"order_id": "A-9001", "user_id": 42}),
)
print(resp["executionArn"])
  


Common Interview Questions:

When should you use Standard vs Express workflows?

Standard for long-running, durable, exactly-once workflows (orchestration, approvals, ETL). Express for high-volume, short-lived event processing (per-request orchestration in APIs, IoT events). Express is dramatically cheaper at scale and supports both at-most-once (sync) and at-least-once (async) modes.

What is Distributed Map and what problem does it solve?

Distributed Map fans out parallel child executions over a dataset (S3 inventory, JSON array, CSV) with up to 10,000 concurrent children and millions of items. It replaces ad-hoc Lambda fan-out patterns and removes the 25,000-event-history limit by isolating each item in its own execution.

How do you implement human approval steps?

Use the .waitForTaskToken integration pattern. The task hands a token to a Lambda or notification step; the workflow pauses until SendTaskSuccess or SendTaskFailure is called with that token (e.g., from an approval link in email).

How do you handle errors and retries declaratively?

Each Task state can declare a Retry array (with backoff, max attempts, error filters) and a Catch array (error filters and a Next state). This eliminates Lambda-level retry plumbing and centralizes error policy in the state machine.

What's the difference between optimized integrations and SDK integrations?

Optimized integrations (e.g., arn:aws:states:::lambda:invoke) provide special behavior like .sync (wait for completion) and .waitForTaskToken. SDK integrations (e.g., arn:aws:states:::aws-sdk:dynamodb:getItem) call any AWS API directly — fewer features, broader coverage.

What are the input/output size limits and how do you work around them?

State transitions cap input/output at 256 KB. For larger payloads, store data in S3 and pass references (the "claim check" pattern) between states.


Step Functions is the go-to AWS orchestrator for any workflow that needs more structure than chained Lambdas. With Distributed Map, Express workflows, and SDK integrations, it covers everything from per-request choreography to long-running, fan-out batch processing.