Chain-of-thought prompting: when step-by-step actually helps (and when it hurts)
Chain-of-thought prompting is the practice of asking a language model to narrate its reasoning before it commits to an answer. The underlying intuition is that the model's intermediate computation matters: forcing it to lay out each step in sequence reduces the chance that a silent error in step two poisons the final result. The technique was formally described by Jason Wei and colleagues at Google Research in 2022, and the gains on arithmetic and symbolic reasoning tasks were large enough to shift how practitioners wrote prompts overnight. But the gains are not universal. We have run the same tasks with and without CoT across enough model families to have a clear picture of where it earns its token cost and where it quietly makes outputs worse.
- What it is: a prompting pattern that inserts explicit reasoning steps between the question and the answer, reliably introduced with "think step by step" or a worked-out few-shot example.
- When it helps: multi-step arithmetic, code debugging, causal reasoning, or any task where a mistake in an early step propagates forward.
- When it hurts: single-hop lookups, pure summarization, latency-sensitive endpoints, and small models that produce authoritative-looking but wrong chains.
- The gate: the CoT Decision Gate below resolves the question in three yes/no checks before you write a single line of prompt.
Table of contents
What is chain-of-thought prompting, exactly?
Chain-of-thought prompting produces a reasoning trace before the answer rather than skipping straight to a conclusion. The simplest version is a single appended phrase. The more reliable version supplies two or three worked examples inside the prompt that each model how a complete chain looks, then ends with the real question. Either way, the structural change is the same: the model is not allowed to emit a final number or verdict as its first token; it must first generate the intermediate computation that justifies it.
This matters because transformer-based language models predict tokens sequentially. A model that jumps immediately to "the answer is 47" is generating that token based on the statistical pattern of questions-followed-by-numbers in its training data. A model that first writes "first, convert the hourly rate to a daily rate; second, multiply by the number of working days; third, subtract the fixed deduction" is generating the answer based on actual arithmetic it just performed in its context window. The chain is not decoration. It is the computation.
What did the research actually show?
The original findings were significant but came with a model-size condition that is widely ignored. Wei et al. (2022, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models") showed accuracy jumps of 10 to 40 percentage points on benchmark arithmetic and commonsense tasks when chain-of-thought prompting was applied. Those gains were measured on models with 100 billion parameters or more. On smaller models, the chains looked superficially structured but produced incorrect intermediate steps, which in practice means you end up with a confident-sounding wrong answer that is harder to catch than a blunt wrong answer with no chain at all.
Subsequent work filled out the picture. Kojima et al. (2022, "Large Language Models are Zero-Shot Reasoners") showed that zero-shot CoT, triggered only by "let's think step by step," captured most of the gain without needing worked examples. Wang et al. (2022, "Self-Consistency Improves Chain of Thought Reasoning") showed that sampling multiple independent chains and taking the majority answer pushed accuracy higher still, at the cost of additional inference calls. These results together define the technique's operating range: large model, verifiable task, cost is acceptable.
The CoT Decision Gate: three questions before you write a prompt
Before reaching for CoT, apply a three-question gate that takes roughly thirty seconds and prevents the most common misapplications. The gate lives in the SVG diagram above; the logic is as follows.
- 1 Is the task multi-step? Does the correct answer require at least two dependent operations, where a mistake in step one would corrupt step two? If no: skip CoT. A model generating a chain around a single-hop retrieval is padding, not reasoning.
- 2 Is the answer verifiable? Can you independently check whether the intermediate steps are correct, either by running code, applying a formula, or consulting a primary source? If no: use CoT cautiously and label the output as unverified. An unverifiable chain can create the illusion of rigor without the substance.
- 3 Is the latency and cost acceptable? A CoT chain adds 50 to 200 tokens of intermediate output per call, plus the tail latency of generating them. For batch-offline analysis that is irrelevant; for a sub-second user-facing endpoint it may not be.
When CoT helps vs hurts: task-type reference table
The Decision Gate above gives you a three-question filter. This table operationalizes it across eleven real task types, each with a one-line reason so you can apply the verdict without re-deriving it. Verdict labels are based on published benchmark results (Wei et al. 2022; Kojima et al. 2022) plus documented model behavior in current provider guides. Rows marked illustrative are constructed examples that generalize from those results; they are not from a specific benchmark.
| Task type | Verdict | Why (one line) | Evidence basis |
|---|---|---|---|
| Multi-step arithmetic (e.g., compound interest, unit conversions) | Helps | Each step depends on the prior one; a silent carry error compounds without a visible chain to catch it. | Wei et al. 2022: 10-40 pp accuracy gain on GSM8K and similar benchmarks in large models. |
| Multi-hop logical reasoning (e.g., syllogisms, causal inference chains) | Helps | The chain externalizes the inference graph, letting the model track which premises have been used. | Wei et al. 2022: StrategyQA and CommonsenseQA gains; Kojima et al. 2022 zero-shot confirmation. |
| Code debugging (trace a bug across multiple call sites) | Helps | Stepping through execution state explicitly reduces the chance the model skips to a wrong fix. | Anthropic Claude docs and OpenAI prompt engineering guide both recommend CoT for debugging tasks. illustrative |
| Legal/regulatory compliance check | Helps | Each jurisdiction rule is an independent check; sequencing them produces an auditable gap map. | Consistent with multi-hop reasoning gains; worked example in this article. illustrative |
| Medical or clinical differential | Helps (with verification) | Listing and eliminating differentials step by step reduces early anchoring on the first plausible diagnosis. | General CoT pattern applies; always verify chain against clinical sources independently. illustrative |
| Simple factual lookup (e.g., capital cities, unit definitions) | Hurts or wastes tokens | The answer is a single retrieval; the chain pads around it and occasionally introduces an error. | Chain inflation failure mode documented in Wei et al. 2022 and reproduced consistently in smaller models. |
| Single-label text classification (e.g., spam vs not spam, sentiment) | Hurts or wastes tokens | Classification is a single mapping from input to label; reasoning steps add latency without changing accuracy on well-trained classifiers. | CoT gains in classification are inconsistent and often negative in benchmarks; direct prompting preferred for production classifiers. illustrative |
| Summarization of a single document | Hurts or wastes tokens | Summarization is extractive/compressive, not inferential; a reasoning chain before a summary adds overhead without a quality benefit. | Consistent with "no multi-step dependency" gate criterion; CoT gains are task-specific. illustrative |
| Latency-sensitive API endpoint (sub-200 ms SLA) | Hurts | A 50-200 token CoT chain adds wall-clock latency proportional to its length; at sub-200 ms SLAs that is user-hostile. | Gate Q3 (latency constraint). Token generation latency is linear in output length for autoregressive models. |
| Open-ended creative generation (fiction, poetry, ideation) | Usually hurts | Forcing a sequential reasoning structure before a creative output suppresses the associative leaps that produce interesting results. | Qualitative pattern; exception is structured creative tasks (outlines, argument maps) where the chain is the product. illustrative |
| Structured argument analysis (e.g., identify premises, evaluate validity) | Helps (with caveats) | The chain is the product here; but without an anti-fabrication clause the model may invent authoritative-looking premises that were not in the source. | Consistent with multi-hop gains; anti-fabrication discipline required. illustrative |
Does zero-shot CoT or few-shot CoT perform better?
Few-shot CoT is more reliable; zero-shot CoT is faster to deploy. The practical split: start with zero-shot by appending "think through this step by step before answering" to your prompt. If the model's chain structure is inconsistent, the steps are sparse, or the format of intermediate results keeps varying, promote to few-shot by prepending two to three worked examples that each show the full chain from question through each step to answer. The added tokens per example are the cost. The benefit is that the model inherits the chain style you demonstrate, which produces cleaner output and makes it easier to parse the steps programmatically if you need to.
One non-obvious pattern: include one deliberately plain example among your few-shot set. If all your examples show impressive multi-step chains, the model anchors to high complexity and may add spurious steps to simple sub-tasks. A single clear, minimal chain in the example set anchors quality at both ends. Call this the sanity baseline: the model learns what a correct short chain looks like, not just what a long one looks like.
Worked prompt examples: bad vs CoT-built
The clearest way to see the difference is side by side. The task: determine whether a proposed SaaS refund policy is legally safe to publish in the UK and EU.
Review this refund policy and tell me if it is legally safe for UK and EU customers. [policy text]
The model jumps to a verdict. It may be correct, but you have no view into whether it checked the Consumer Rights Act 2015, the EU Consumer Rights Directive, digital goods rules, or the 14-day cooling-off period. A mistake anywhere is invisible.
You are a consumer-law compliance analyst specialising in UK and EU digital goods regulation. Review the refund policy below. Before you give a verdict, work through these steps explicitly: 1. Identify which jurisdiction rules apply (UK Consumer Rights Act 2015, EU Consumer Rights Directive 2011/83/EU, digital goods exceptions). 2. Check each clause against the mandatory disclosures required in each jurisdiction. 3. Flag any clause that is missing, ambiguous, or conflicts with mandatory rights. 4. Only then give a verdict: COMPLIANT / NEEDS REVISION / NON-COMPLIANT, with a one-line reason per flag. Do not invent case citations; name the statutory instrument and clause if you reference law, and label any inference as [inferred]. [policy text]
Now the chain is auditable. If the model misses the 14-day cooling-off period, you can see the gap in step two rather than finding it when a customer files a chargeback.
Notice what the CoT-built version adds beyond the step instruction: a specific named role with jurisdiction expertise (not "you are a legal expert"), an output format that names verdict categories, and an explicit anti-fabrication clause that limits the model to named statutes. That combination of techniques is what makes CoT actually useful in production rather than theatrical. For a complete treatment of the role, format, and anti-fabrication layers, see the RAILS framework guide.
You are a {{named_competence}} with expertise in {{specific_domain}}. Task: {{task_description}} Before answering, think step by step: - Step 1: {{first_dependency}} - Step 2: {{second_dependency}} - Step N: draw a conclusion from the above steps only. Output format: {{verdict_field}} | {{reasoning_field}} | {{confidence_field}} Anti-fabrication rule: name the evidence type you are drawing on (document, formula, statute). Mark any inference [inferred]. If you cannot complete a step with available information, state that and stop. {{input}}
How does CoT connect to the RAILS Loop?
Chain-of-thought is the reasoning engine; the RAILS Loop is the quality control layer that sits on top of it. The Loop pattern, which we treat as the highest-leverage structural move in reusable prompt design, appends a self-scoring rubric to the end of the prompt and instructs the model to revise and re-score if it falls below a threshold. Applied to a CoT prompt, this means the model first generates its step-by-step chain, scores it against the rubric (slop density, argument clarity, step completeness), and revises the chain if needed before emitting the final output. The combination is more reliable than either technique alone because the chain gives you something auditable to score and the loop forces a revision cycle when the chain is weak.
Here is the Loop appended to a CoT prompt:
[Role, task description, step-by-step instruction, output format, anti-fabrication clause as above] After completing your chain, score your output: - Step completeness (0-3): did you address every required step? - Evidence quality (0-3): did every claim cite a source type or flag as [inferred]? - Argument clarity (0-3): would a non-expert follow the logic? If your total score is below 7 out of 9, revise the chain and re-score before emitting the final answer. Show the score alongside the output.
The Loop instruction costs roughly 60 additional tokens per call. In batch analysis or high-stakes generation (contracts, medical summaries, code review) those tokens pay for themselves many times over. The RAILS framework is documented in full in our complete prompt engineering guide.
Where does chain-of-thought prompting fail?
The four failure modes are well-documented and avoidable if you know to look for them.
- Sycophantic chains. On some models, the chain adapts itself to the conclusion it thinks you want rather than reasoning to an objective answer. If your question implies a preferred outcome ("our policy is fine, right? let's think through why"), the chain will often construct post-hoc justification. Mitigate by phrasing the task neutrally and including the refuser clause: "If the evidence supports a negative finding, state that directly."
- Plausible wrong chains on small models. A model with insufficient capacity to perform actual arithmetic or logical reasoning will still generate a chain that looks like reasoning. The intermediate steps are syntactically plausible but mathematically wrong. Always verify intermediate steps on arithmetic tasks regardless of model size.
- Chain inflation on simple tasks. On single-hop lookups, the model pads the chain to match what it learned from training on reasoned texts. The chain adds no accuracy and the padding occasionally introduces an error by making the model second-guess a retrieval it would have gotten right with a direct answer.
- Fabricated citations inside the chain. Without an explicit anti-fabrication rule, the model may invent named studies, statutes, or prior cases as part of its chain. The chain format makes these fabrications look authoritative because they appear mid-reasoning rather than as a final assertion. The anti-fabrication clause in the template above is not optional; it is load-bearing.
Keeping the chain honest: anti-fabrication discipline for CoT prompts
Chain-of-thought is where hallucinations go to look credible. A wrong number dropped directly in an answer is visibly suspicious. The same wrong number embedded in step three of a seven-step reasoning chain inherits the authority of steps one and two. The discipline for CoT prompts is the same as for any prompt, but the stakes of skipping it are higher.
Three rules that pay their weight:
- Tell the model what evidence type to cite, not what evidence to produce. "Name the statute and clause" or "cite the formula you are applying" asks for a pointer, not a fabrication.
- Require the model to label inference as [inferred] and distinguish it from citation.
- Demand that the model stop and report rather than guess when a required step cannot be completed with available information. "If you cannot verify step three, say so and explain what source would be needed" produces a useful gap marker rather than a fabricated bridge.
These constraints are part of the broader anti-fabrication layer in the RAILS framework, which is one of the five rules in that system. They apply to all prompts; they are just more urgent in CoT because the chain provides cover for hallucinations that direct-answer prompts do not.
Who should not use chain-of-thought prompting?
Honest anti-recommendation. CoT is not the default mode for all prompts, and adding it indiscriminately is a common mistake made by practitioners who read the 2022 results and applied them universally.
- Anyone running sub-200ms endpoints. CoT adds latency. On a chatbot that needs to feel responsive, a 150-token reasoning chain before every reply is user-hostile. Use direct-answer prompts with a carefully structured role and output contract instead.
- Prompts that target small or free-tier models exclusively. If your deployment is capped at a model that cannot reliably perform multi-step arithmetic in a chain, CoT provides the appearance of rigor without the substance. Run the CoT Decision Gate question two: is the answer verifiable? If yes, verify it externally rather than trusting the chain.
- Creative and generative tasks. The associative quality that makes creative outputs interesting comes partly from the model making non-linear connections. Forcing a step-by-step chain before a creative output often produces text that is structured and boring. The exception is structured creative tasks like outlining or argument mapping, where the chain is the product.
- Single-hop factual lookups. "What is the capital of France" does not benefit from reasoning steps. Neither does "summarize this paragraph." CoT on retrieval tasks adds tokens and occasionally introduces errors the direct answer would not have made.
Frequently asked questions
What is chain-of-thought prompting?
When does chain-of-thought prompting help?
What is the difference between zero-shot and few-shot chain-of-thought?
Does chain-of-thought prompting work on smaller models?
Can chain-of-thought make outputs worse?
Bottom line
Chain-of-thought prompting is a genuine capability amplifier on the tasks it is designed for: multi-step reasoning, verifiable arithmetic, causal analysis. The Wei et al. 2022 result is real and the technique is worth knowing. But it is not a universal upgrade switch. Applied to the wrong task, on the wrong model, or without an anti-fabrication clause, it produces outputs that are harder to correct than a straightforward wrong answer would have been. Apply the CoT Decision Gate before you write the prompt, pair it with the RAILS Loop when accuracy matters, and always tell the model what evidence type to cite rather than letting it invent one. That combination, taught step by step, is how CoT graduates from a blog-post trick to a production-grade technique.
This article is part of the RAILS prompt engineering series at Nesyona. For evaluation patterns, see our sibling spoke on how to evaluate AI models and outputs.
- Wei et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903 primary source.
- Kojima et al. (2022). "Large Language Models are Zero-Shot Reasoners." arXiv:2205.11916 primary source.
- Wang et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171 primary source.
- OpenAI. "Prompt Engineering Guide." platform.openai.com verified Jun 2026.
- Anthropic. "Chain-of-Thought Prompting." Claude Documentation. verified Jun 2026.