Step Functions for orchestrating long-running jobs, The Cloud Ledger

We had a video transcoding pipeline that ran for up to four hours, glued together by a Lambda that polled a queue, which triggered another Lambda, which wrote status to DynamoDB. When a step failed at hour three, nobody could tell where it died or how to resume it. Rewriting it as a Step Functions state machine turned an opaque chain into a thing I could actually watch and retry.

Long-running orchestration is where Step Functions earns its keep. The challenge is that Lambda caps at 15 minutes, so the orchestration itself has to outlive its workers. Here's how the pieces fit.

Standard vs Express: pick by duration

The first decision is the workflow type, and it's irreversible per state machine:

Dimension	Standard	Express
Max duration	1 year	5 minutes
Pricing	Per state transition	Per request + duration
Execution semantics	Exactly-once	At-least-once
History	Full, 90 days	CloudWatch Logs only

For long jobs you want Standard, the 1-year ceiling and the visual execution history are exactly what you're paying for. Express is for high-volume short bursts like stream processing.

Don't block, use the callback pattern

The key trick for work that exceeds Lambda's 15-minute limit is the .waitForTaskToken integration. Step Functions hands a task token to your worker (a Batch job, an ECS task, anything), then pauses, paying nothing and holding no compute, until that worker calls back with SendTaskSuccess. The job can run for hours; the state machine just waits.

{
  "Transcode": {
    "Type": "Task",
    "Resource": "arn:aws:states:::aws-sdk:batch:submitJob.waitForTaskToken",
    "Parameters": {
      "JobName": "transcode-4k",
      "JobQueue": "render-queue",
      "JobDefinition": "transcoder:7",
      "Parameters": { "taskToken.$": "$$.Task.Token" }
    },
    "TimeoutSeconds": 21600,
    "Retry": [{
      "ErrorEquals": ["States.Timeout"],
      "MaxAttempts": 2,
      "BackoffRate": 2.0
    }],
    "Next": "Publish"
  }
}

Your AWS Batch container does the heavy lifting and, on completion, calls back:

import boto3
sfn = boto3.client("stepfunctions")
sfn.send_task_success(
    taskToken=task_token,
    output='{"status": "transcoded", "key": "out/video.mp4"}'
)

Polling a queue from Lambda costs you compute every second you wait. The callback pattern costs nothing while idle, that difference is enormous for jobs measured in hours.

Build retries and catches into the graph

The thing I'd hand-rolled badly was failure handling. In Step Functions it's declarative. Each task gets a Retry block with exponential backoff and a Catch that routes failures to a cleanup or notification state. A transient throttle retries automatically; a genuine failure lands in a dead-letter state where I can inspect the exact input that broke it.

Mind the limits that bite long jobs

Payload size: state input/output is capped at 256 KB. For large data, pass an S3 pointer between states, not the data itself.
History size: Standard executions cap at 25,000 events. A long loop can blow this, use a distributed map or child executions for fan-out.
Idempotency: design callbacks to tolerate replays, since a failed network call could resend a success.

Operability is the real win

Beyond the mechanics, the payoff is that every execution is a visible graph in the console. When a job fails at hour three, I open the execution, see the red state, read its exact input and the error, and either redrive it or fix the input. With redriveExecution I can resume a failed Standard execution from the failed state rather than rerunning three hours of successful work.

Takeaways

Choose Standard workflows for long jobs, the 1-year limit and full execution history are the point.
Use .waitForTaskToken so the orchestrator pays nothing while a multi-hour worker runs.
Pass S3 pointers between states to stay under the 256 KB payload limit.
Lean on declarative Retry/Catch and redriveExecution instead of hand-rolled failure handling.