Prompt chaining: how to wire multi-step AI workflows (2026)
- Prompt chaining is the practice of breaking a complex task into a sequence of focused prompts where each step emits a precisely defined payload that the next step consumes as input.
- The key concept: a Chain Handoff Spec, the output contract you write before writing any prompt, that specifies format, required fields, and a fallback for malformed responses.
- When to chain: when a task has multiple reasoning modes, when you need to branch or inspect mid-workflow, or when one step needs a different model or temperature than another.
- Read this as part of our complete prompt engineering guide, the RAILS series hub.
Table of contents
What is prompt chaining, and why does it matter in 2026?
A single prompt asking a model to simultaneously research a topic, draft a document, critique that draft, reformat it, and flag any factual risks is not one task. It is five tasks, each with different cognitive demands, competing for the same forward pass. The model does not refuse. It attempts all five at once, and the result is output that is competent at none of them.
Prompt chaining fixes this by decomposing the workflow into a linear (or branching) sequence of discrete steps, each narrowly scoped to one reasoning mode. Step one researches. Step two drafts against the research. Step three critiques the draft. Step four reformats for delivery. The model is not smarter on any individual step. What changes is that each step gets the model's full attention and, critically, that each step's output is a clean, structured payload the next step can consume without guessing.
This is not a novel idea. The 2022 ReAct paper by Yao et al. from Google and Princeton (arXiv:2210.03629) showed that interleaving reasoning traces with actions outperformed both pure reasoning and pure action-taking, precisely because it structured the flow of information across steps. The 2020 RAG paper by Lewis et al. at Facebook AI (arXiv:2005.11401) is, at its core, a two-step chain: retrieve, then generate. What is new in 2026 is that the tooling for executing and inspecting chains is broadly available through the Anthropic prompt engineering docs and the OpenAI prompt engineering guide, and that context windows are large enough that naive single-prompt approaches are feasible, making the discipline of knowing when to chain versus when to stay single-prompt actually interesting.
Why do most prompt chains break before they ship?
Most chains break not because the prompts are weak but because the handoffs are undefined. The operator writes step one, runs it, likes the output, writes step two, pastes step one's output into it manually, and it works in that one demo session. The following week step one returns a slightly different format, step two cannot parse it, and the output is garbage. Nobody wrote down what step one was supposed to emit. There was no contract.
The fix is to write the handoff before writing the prompts. Decide what step one must produce in order for step two to function. Make that a typed specification. Then write step one's prompt to produce that specification, and write step two's prompt to consume it. The sequence forces clarity that rarely surfaces when you just start writing prompts top-down.
This is the same principle behind typed interfaces in software. You do not agree on how two modules work at runtime. You agree on the interface before either module is written, and each module is judged against whether it satisfies the interface. A Chain Handoff Spec is a prompt-layer interface contract.
What is the Chain Handoff Spec, and how do you write one?
The Chain Handoff Spec is the output contract each prompt step must satisfy before the next step is permitted to consume its output. It answers four questions: (1) what format must the output be in, (2) which fields are required, (3) what each field must contain, and (4) what the consuming step should do if the output is malformed or missing a required field.
Writing one forces you to answer questions that most chain designers skip. What does "research output" actually mean in machine-readable terms? A list of claims? A list of sources? A confidence score per claim? A word count? Without answers to these questions, step two must guess, and guessing accumulates error across the chain.
Here is a full worked handoff spec for a three-step content research chain. The spec is written first, before any prompt is drafted.
// STEP 1 OUTPUT CONTRACT { "step": "research", "topic": "string - the exact topic passed in", "claims": [ { "claim": "string - one verifiable assertion", "source": "string - URL or citation, or '[unverified]'", "confidence": "high | medium | low" } ], "claim_count": integer, "gaps": ["string - things the research could not verify"] } // STEP 2 OUTPUT CONTRACT (consumes step 1, emits for step 3) { "step": "draft", "sections": [ { "heading": "string", "body": "string - prose, markdown allowed", "claims_used": [integer] // index into step 1 claims array } ], "word_count": integer, "unverified_claims_used": integer } // STEP 3 OUTPUT CONTRACT (final consumer) { "step": "critique", "verdict": "approve | revise | block", "score": integer, // 0-100 "issues": [ { "severity": "critical | major | minor", "section": "string", "description": "string", "fix": "string" } ], "fallback": "If score cannot be computed, return verdict: 'block', issues: [{severity: 'critical', description: 'malformed input'}]" }
Write the spec before writing any prompt. Each prompt's instructions section says "Emit EXACTLY this JSON structure." The fallback clause in step 3 means a parsing error never silently propagates.
Notice a few deliberate choices. Every field that carries content uses a string. Arrays are typed. The confidence field is an enum, not a free-string. The critique step carries an explicit fallback clause that specifies what to emit when the input is broken, which means an error never silently propagates as a blank or hallucinated output. The claims_used field in step two is an index into step one's claims array, which creates a traceable provenance chain from final copy back to source.
What does the anatomy of a well-wired chain look like?
Every prompt step in a well-structured chain carries five layers. These map directly to the RAILS letters taught in the RAILS guide: Role, Architecture, Instructions, Loop, Safety. In a chain, each of the five layers does specific work at the boundary between steps.
| RAILS layer | Role in a chain step | What the step must emit |
|---|---|---|
| R: Role | The persona is scoped to this step's reasoning mode only. A research step gets "senior research analyst." A critique step gets "skeptical editor." Using the same broad role across all steps dilutes the signal. | N/A (governs reasoning, not output format) |
| A: Architecture | The system prompt declares: what this step receives as input, what it must not do (fabricate sources, add commentary outside the schema), and the exact output schema. | The schema declaration itself |
| I: Instructions | Priority-ordered execution rules. The first rule is always: "Consume the input payload according to its schema." The last rule is always: "Emit EXACTLY the output schema. Do not add fields. Do not omit required fields." | The correctly shaped output object |
| L: Loop | The self-critique clause. Before finalizing output, the step scores its own output against the handoff spec. Required fields present? Enum values valid? Word count within range? If any check fails, it revises and re-scores. | A self-audit note appended to output (optional, useful during development) |
| S: Safety | The refuser clause. If the input payload is malformed or missing required fields, the step emits the fallback structure defined in the handoff spec rather than guessing. It does not try to infer what a broken input probably meant. | The fallback payload (when triggered) |
The Loop and Safety layers are the ones practitioners skip. Skipping Loop means a prompt that generates plausible-looking but structurally broken JSON will never catch itself. Skipping Safety means a step that receives garbage from the previous step will hallucinate a response rather than reporting that the input contract was violated. Both failures are silent in production, which is why chains built without these layers look fine in testing and break in ways that are hard to diagnose.
Should you split into a chain, or keep it as one prompt?
The honest answer is that many prompts that operators reflexively chain would perform equally well as a single, well-structured prompt. Chaining adds latency, adds API cost proportional to the number of steps, and adds a new class of failure mode (handoff errors) that does not exist in single-prompt designs. It should earn its keep.
- Does the task require more than one reasoning mode? (Research vs. synthesis vs. critique vs. formatting are different modes. Summarization and formatting are not.) If yes, lean toward splitting. If no, keep single.
- Do you need to inspect, branch, or route mid-workflow? If a human or automated check needs to review or redirect the output of one step before the next step runs, you need a chain. A single prompt cannot expose an intermediate checkpoint. If no inspection is needed, keep single.
- Does one stage need a different model or temperature than another? Research benefits from a larger, higher-temperature model. A formatting or JSON-normalization step works fine on a faster, cheaper model at temperature 0. If yes, split. If the same model and temperature work throughout, a single prompt is cheaper.
Task: summarize three documents and return a JSON verdict
This is a concrete task where the split-or-keep decision is genuinely non-obvious. A single prompt can handle it. A 3-step chain can also handle it. The table below shows what each approach actually costs in latency, tokens, and failure surface. Numbers are illustrative based on published API pricing as of June 2026 and typical latency benchmarks from artificialanalysis.ai. Your real numbers will differ by model, region, and payload size.
| Dimension | Single prompt (one call) | 3-step chain (extract / synthesize / format) | Who wins |
|---|---|---|---|
| Latency (wall clock) | ~1.2 s TTFT + generation time for one call. Illustrated: ~4 s total for a 600-token output on Claude Sonnet 4.6. | 3 sequential round-trips. Illustrated: ~12 s total (each step ~4 s; no parallelism). Parallelizable steps can reduce this but the task here is sequential by design. | Single |
| Token cost | Illustrated: 3 docs x ~800 tokens input + system prompt ~200 = ~2,600 input tokens; ~300 output tokens. At Claude Sonnet 4.6 pricing ($3 / 1M input, $15 / 1M output): ~$0.013 per run. | Step 1 extracts (~2,600 in, ~500 out), Step 2 synthesizes (~700 in, ~400 out), Step 3 formats (~500 in, ~200 out). Total: ~3,800 input, ~1,100 output tokens. Illustrated cost: ~$0.028 per run (~2.1x single-prompt cost). | Single |
| Failure points | One failure surface: the model misreads all three documents together, conflates claims, or drifts on JSON format. Single recovery path: retry the whole call. | Three failure surfaces: Step 1 can truncate or mis-extract; Step 2 can synthesize using a malformed Step 1 payload; Step 3 can mis-format. But each failure is isolated and diagnosable. A bad Step 2 output does not require re-running Step 1. | Depends |
| Debuggability | On failure, the full reasoning is opaque inside one pass. Hard to tell whether the error was in extraction, synthesis, or formatting without re-prompting for a trace. | Each step's output is inspectable. If the final verdict is wrong, you can read Step 1's extraction payload and isolate whether the error entered at extraction, synthesis, or formatting. | Chain |
| Reliability under load | On a long single prompt, the model must hold context for all three documents simultaneously while also tracking JSON schema compliance. Accuracy can degrade as document count scales. | Each step gets one focused job. Step 1 extracts from one document at a time (can loop). Accuracy per step is higher for complex extraction tasks. Reliability scales better with document count. | Chain |
| Model flexibility | One model, one temperature for all tasks: extraction, synthesis, and JSON formatting all share the same settings. | Step 1 can use a larger, higher-temperature model for nuanced extraction. Step 3 (JSON formatting) can use a faster, cheaper model (e.g., GPT-4o mini at $0.15 / 1M input) at temperature 0. Can reduce cost per step selectively. | Chain |
| When to use | 3 short docs, predictable schema, low volume, latency-sensitive. One-off or low-stakes outputs. No need to inspect intermediate reasoning. | 3+ docs, complex extraction with heterogeneous structure, high volume (where per-step caching saves money), human-in-the-loop review at synthesis step, or when extraction errors in production are costly to diagnose. | Context-dependent |
If fewer than two of those three questions return yes, keep the prompt single. The single-prompt design is not naive; it is the right tool for tasks that do not require the complexity of a chain. The system prompt design guide covers how to structure a single prompt that handles multi-part tasks without chaining.
Worked example: a three-step content research chain
The handoff spec above defined the structure. Here are the three prompt bodies that implement it. These are real, runnable prompts, not illustrative sketches. The inputs are parameterized using template variable slots from the prompt templates guide.
You are a senior research analyst. Your only job in this step is to surface verifiable claims about the topic and identify what cannot be verified. You do not draft, opine, or format for readers. You emit structured research output only.
FORBIDDEN: do not fabricate sources. If you cannot find a source, mark the claim "[unverified]". Do not use em-dashes, hedging phrases ("it is worth noting"), or filler.
OUTPUT: emit EXACTLY the Step 1 JSON schema. No prose outside the schema. If topic is missing or blank, emit: {"step":"research","topic":"","claims":[],"claim_count":0,"gaps":["No topic provided"]}
USER MESSAGE (template):
Topic: {{topic}} Scope: {{scope_notes}} Depth: {{depth}} claims maximum Research the topic above. For each claim: state it clearly, provide the best available source URL or "[unverified]", and rate confidence as high/medium/low. List gaps where research was insufficient or contradicted itself.
Example output (topic = "prompt chaining latency cost"):
{
"step": "research",
"topic": "prompt chaining latency cost",
"claims": [
{
"claim": "Each additional chain step adds one full model round-trip, so a 3-step chain triples minimum latency vs a single call",
"source": "https://platform.openai.com/docs/guides/latency-optimization",
"confidence": "high"
},
{
"claim": "Caching shared context between steps reduces token cost on repeated runs",
"source": "https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching",
"confidence": "high"
}
],
"claim_count": 2,
"gaps": ["No published benchmark comparing chain vs single-prompt accuracy at equivalent task complexity"]
}
Step two receives the step-one JSON as its input and drafts against the verified claims, citing them by index. Step three receives the step-two draft and the step-one claims together, and evaluates whether the draft used claims accurately and flagged the unverified ones. The three prompts together take about four minutes to write once the handoff spec exists. Without the spec, the same work takes far longer because the drafter is constantly deciding what the research output "probably" contains.
For a deeper look at how this self-critique clause works in isolation, see our spoke on the chain-of-thought technique, which covers the Loop layer in detail.
Where do prompt chains fail in production?
Chain failure is almost always a boundary problem, not a prompt quality problem. The individual prompts are fine. The handoffs are broken. Four failure modes account for the majority of production chain failures.
Step two expects a JSON object with a "claims" array. Step one intermittently returns a Markdown list of claims instead, because it was not told to refuse deviation from schema. Step two silently receives prose and tries to interpret it. Output varies run-to-run with no error.
FIX: Add to step one's system prompt: "If you cannot produce valid JSON matching the schema, emit the fallback structure. Do not approximate." Add a schema-validation check in the orchestration layer before passing to step two.
Step one fabricates a source URL marked as "high" confidence. The mark travels through steps two and three unchallenged because no step was given a mandate to verify what it received. The final output confidently cites a hallucinated paper.
FIX: The anti-fabrication instruction belongs in step one (never invent sources), and the critique step (step three) must explicitly check: "Are any sources in the draft marked as '[unverified]' in the step-one payload but cited without that label in the draft?"
Step two receives the full step-one output, including internal chain-of-thought traces the model produced before the JSON. The traces introduce framing that biases step two's draft in a direction the operator did not intend. The output is coherent but slanted.
FIX: Parse step one's JSON in the orchestration layer and pass only the schema fields to step two. Never inject the model's raw response. Thinking text is for the operator, not the downstream step.
A large step-one output is passed into step two's context window but exceeds the available space after the system prompt is loaded. The model receives a truncated payload with no warning. Step two drafts as if the claims that were cut do not exist.
FIX: Count tokens before injecting step-one output. If the payload exceeds (model context window minus system prompt tokens minus expected output tokens), either compress step-one output (summarize or filter to required fields) or split step two into sub-steps.
What is the difference between a prompt chain and an agent?
A prompt chain is deterministic. The operator defines all the steps, all the branching logic, and all the handoff contracts at design time. The model has no say in which step runs next or when the workflow terminates. The chain executes the same sequence every time, modulo the content of the inputs.
An agent is a system where the model itself uses tools or sub-models to decide what to do next. Given a goal, an agent may choose to call a search tool, read a file, call a sub-prompt, loop three more times, or declare the task done. The branching and termination logic is generated by the model at runtime rather than specified by the operator. The AI agent frameworks comparison covers the current tool landscape in detail.
Neither is universally better. Chains are predictable, auditable, and cheap per run. Agents are flexible for tasks where the branching logic cannot be anticipated. The practical heuristic is to start with a chain and only promote to an agent when you find yourself writing conditional chain logic that depends on the model's own judgment about the content (not just its format). A chain that says "if step one score is above 70, proceed to step two; otherwise re-run step one" is still a chain. A workflow that says "decide what tools to use and in what order to achieve this goal" is an agent.
When should you promote a chain to a blueprint?
The answer has a concrete threshold: when you have run the same chain three or more times on different inputs and you are still manually wiring the handoffs, it is time to promote it to a parameterized, version-pinned reusable unit. You stop treating it as a one-off sequence and start treating it as infrastructure.
Promotion means: assign the chain a name and a version number, lock the handoff specs to specific schema versions, parameterize every input that varies across runs using explicit template variable slots (covered in the templates and variables spoke), and document the preconditions (what the first step requires to be present in the input payload for the chain to succeed).
When a chain gets serious enough that multiple steps each make downstream decisions based on each other's outputs, you want each step wired by data flow, not copy-paste. That multi-step wiring is what BrainBoot calls a Blueprint, a named multi-brain workflow where each brain (prompt system) is wired to the next by data flow rather than manual handoff. (Disclosure: BrainBoot is our own tool, brainboot.dev.) The concept is worth understanding regardless of whether you use that specific tool: the insight is that once a chain is stable and reused regularly, it stops being a prompt exercise and starts being a workflow artifact that deserves the same versioning and parameterization discipline as production code.
Frequently asked questions
What is prompt chaining?
When should I chain prompts instead of using one long prompt?
What is a Chain Handoff Spec?
How do prompt chains fail in production?
What is the difference between a chain and an agent?
Bottom line
Prompt chaining is not a complexity upgrade. It is a clarity discipline. The question it forces you to answer, "what must this step emit for the next step to work?", is one you were always implicitly answering when you wrote multi-step prompts. Chaining just makes the answer explicit, testable, and durable.
Write the Chain Handoff Spec first. Apply the Chain Split Gate before deciding whether a chain is warranted. Wire the RAILS Loop and Safety layers into each step so the chain fails loudly rather than silently. When a chain stabilizes and you are running it repeatedly, version and parameterize it. That is the full discipline, and it fits on an index card.
For the foundational techniques that make each step in your chain reliable, the RAILS prompt engineering guide covers all twelve techniques including role priming, structured-output contracts, forbidden-pattern lists, and the self-critique loop that is the single highest-leverage addition to any prompt. The few-shot and chain-of-thought spokes cover the exemplar and reasoning layers that strengthen individual steps before you wire them together.