Step Functions for orchestrating long-running jobs
Coordinate retries, timeouts, and human approval without writing your own state machine.
We had a video transcoding pipeline that ran for up to four hours, glued together by a Lambda that polled a queue, which triggered another Lambda, which wrote status to DynamoDB. When a step failed at hour three, nobody could tell where it died or how to resume it. Rewriting it as a Step Functions state machine turned an opaque chain into a thing I could actually watch and retry.
Long-running orchestration is where Step Functions earns its keep. The challenge is that Lambda caps at 15 minutes, so the orchestration itself has to outlive its workers. Here's how the pieces fit.
Standard vs Express: pick by duration
The first decision is the workflow type, and it's irreversible per state machine:
| Dimension | Standard | Express |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Pricing | Per state transition | Per request + duration |
| Execution semantics | Exactly-once | At-least-once |
| History | Full, 90 days | CloudWatch Logs only |
For long jobs you want Standard, the 1-year ceiling and the visual execution history are exactly what you're paying for. Express is for high-volume short bursts like stream processing.
Don't block, use the callback pattern
The key trick for work that exceeds Lambda's 15-minute limit is the .waitForTaskToken integration. Step Functions hands a task token to your worker (a Batch job, an ECS task, anything), then pauses, paying nothing and holding no compute, until that worker calls back with SendTaskSuccess. The job can run for hours; the state machine just waits.
{
"Transcode": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:batch:submitJob.waitForTaskToken",
"Parameters": {
"JobName": "transcode-4k",
"JobQueue": "render-queue",
"JobDefinition": "transcoder:7",
"Parameters": { "taskToken.$": "$$.Task.Token" }
},
"TimeoutSeconds": 21600,
"Retry": [{
"ErrorEquals": ["States.Timeout"],
"MaxAttempts": 2,
"BackoffRate": 2.0
}],
"Next": "Publish"
}
}
Your AWS Batch container does the heavy lifting and, on completion, calls back:
import boto3
sfn = boto3.client("stepfunctions")
sfn.send_task_success(
taskToken=task_token,
output='{"status": "transcoded", "key": "out/video.mp4"}'
)
Polling a queue from Lambda costs you compute every second you wait. The callback pattern costs nothing while idle, that difference is enormous for jobs measured in hours.
Build retries and catches into the graph
The thing I'd hand-rolled badly was failure handling. In Step Functions it's declarative. Each task gets a Retry block with exponential backoff and a Catch that routes failures to a cleanup or notification state. A transient throttle retries automatically; a genuine failure lands in a dead-letter state where I can inspect the exact input that broke it.
Mind the limits that bite long jobs
- Payload size: state input/output is capped at 256 KB. For large data, pass an S3 pointer between states, not the data itself.
- History size: Standard executions cap at 25,000 events. A long loop can blow this, use a distributed map or child executions for fan-out.
- Idempotency: design callbacks to tolerate replays, since a failed network call could resend a success.
Operability is the real win
Beyond the mechanics, the payoff is that every execution is a visible graph in the console. When a job fails at hour three, I open the execution, see the red state, read its exact input and the error, and either redrive it or fix the input. With redriveExecution I can resume a failed Standard execution from the failed state rather than rerunning three hours of successful work.
Takeaways
- Choose Standard workflows for long jobs, the 1-year limit and full execution history are the point.
- Use
.waitForTaskTokenso the orchestrator pays nothing while a multi-hour worker runs. - Pass S3 pointers between states to stay under the 256 KB payload limit.
- Lean on declarative
Retry/CatchandredriveExecutioninstead of hand-rolled failure handling.