Workflow Time Travel
What it gives you
After every step of aWorkflow.runWithCheckpoints() execution, the framework snapshots the state and writes a checkpoint. You can later:
- Replay a run from any checkpoint (deterministically reapply later steps).
- Fork a run at any checkpoint with a mutated state to explore alternatives.
- List all checkpoints to drive a “rewind” UI.
Workflow model.
Architecture
- The full state at that point
runId,stepIndex,stepName,createdAt- An incremental
id(the samerunId:step-Nshape used to look it up)
WorkflowCheckpointStore interface
StorageBackedCheckpointStore
Built-in implementation that uses any StorageDriver:
keepLastN is best-effort: after each save(), if more than keepLastN checkpoints exist for the run, the oldest are deleted. Set to Infinity (or omit) to retain everything.
Define a workflow with checkpointing
runWithCheckpoints()
Execute the workflow and persist a checkpoint after every step:
runId to overwrite checkpoints under that ID:
listCheckpoints(runId)
stepIndex ascending. The first entry has stepIndex: -1, stepName: "initial" and contains the state BEFORE any step ran.
replay(checkpointId)
Re-execute the workflow from a specific checkpoint:
run functions are non-deterministic (calling LLMs, hitting external APIs), replays will produce different outputs. You can:
- Mock side effects: override side-effecting calls during replay.
- Cache step outputs: wrap each step’s
runinRetryEnvelope+ a cache keyed by(stepIndex, JSON.stringify(state)). - Treat replay as “from here, with the live world”: acceptable for debugging.
fork(checkpointId, mutator?)
Branch a new run from a checkpoint with optionally mutated state:
runId).
Forks have completely independent checkpoint chains. Listing checkpoints on the parent and forked runIds shows two disjoint histories.
Use cases
Debugging a flaky agent
A workflow that orchestrates four agents fails at step 3. Instead of rerunning the entire pipeline:A/B exploration
After running an analysis workflow, fork from step 2 with different parameters to see how the final answer changes:Resumable long workflows
A 10-step workflow gets killed by SIGTERM after step 6. The next pod picks it up:Audit / time travel UI
Render the checkpoint list as a timeline. Clicking step N shows the state right before step N+1 ran. Clicking “fork from here” creates a sandbox run.Combining with DrainController
Graceful K8s rollouts are the canonical “save a checkpoint and exit” scenario:
Performance
- Each checkpoint write is one
StorageDriver.setcall. For SQLite that’s ~1ms; for Postgres ~5ms. - Checkpoint size = state size. Keep state lean by storing references (IDs) rather than full LLM messages.
keepLastN: 50is a reasonable default. For tight memory budgets, drop to 10.
See also
- Workflows overview — base Workflow class
- Resumable SSE — pair with checkpointing for graceful drain
@agentium/queue— for workflow runs that should survive process restart entirely