Published June 2026 · 12 min read · Part of the RAILS prompt engineering series
Part of our complete prompt engineering guide ›

Chain-of-thought prompting: when step-by-step actually helps (and when it hurts)

Chain-of-thought prompting is the practice of asking a language model to narrate its reasoning before it commits to an answer. The underlying intuition is that the model's intermediate computation matters: forcing it to lay out each step in sequence reduces the chance that a silent error in step two poisons the final result. The technique was formally described by Jason Wei and colleagues at Google Research in 2022, and the gains on arithmetic and symbolic reasoning tasks were large enough to shift how practitioners wrote prompts overnight. But the gains are not universal. We have run the same tasks with and without CoT across enough model families to have a clear picture of where it earns its token cost and where it quietly makes outputs worse.

Last reviewed: June 2026 Next review: December 2026
Bottom line up front
Table of contents
  1. What chain-of-thought prompting actually is
  2. What the research actually showed
  3. The CoT Decision Gate
  4. When CoT helps vs hurts: task-type table
  5. Zero-shot vs few-shot CoT
  6. Worked prompt examples
  7. CoT and the RAILS Loop
  8. Where CoT fails
  9. Keeping the chain honest
  10. Who should not use CoT
  11. FAQ
  12. Bottom line
THE COT DECISION GATE Q1 Multi-step task? Does step 2 depend on step 1? NO Skip CoT YES Q2 Verifiable answer? Can you check if step N is correct? NO Use with caution YES Q3: Latency / cost acceptable? YES Use CoT LEGEND Use CoT Skip CoT Caution (verify chain) Rose = gate criteria met. All three YES = CoT justified.

What is chain-of-thought prompting, exactly?

Chain-of-thought prompting produces a reasoning trace before the answer rather than skipping straight to a conclusion. The simplest version is a single appended phrase. The more reliable version supplies two or three worked examples inside the prompt that each model how a complete chain looks, then ends with the real question. Either way, the structural change is the same: the model is not allowed to emit a final number or verdict as its first token; it must first generate the intermediate computation that justifies it.

This matters because transformer-based language models predict tokens sequentially. A model that jumps immediately to "the answer is 47" is generating that token based on the statistical pattern of questions-followed-by-numbers in its training data. A model that first writes "first, convert the hourly rate to a daily rate; second, multiply by the number of working days; third, subtract the fixed deduction" is generating the answer based on actual arithmetic it just performed in its context window. The chain is not decoration. It is the computation.

What did the research actually show?

The original findings were significant but came with a model-size condition that is widely ignored. Wei et al. (2022, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models") showed accuracy jumps of 10 to 40 percentage points on benchmark arithmetic and commonsense tasks when chain-of-thought prompting was applied. Those gains were measured on models with 100 billion parameters or more. On smaller models, the chains looked superficially structured but produced incorrect intermediate steps, which in practice means you end up with a confident-sounding wrong answer that is harder to catch than a blunt wrong answer with no chain at all.

Subsequent work filled out the picture. Kojima et al. (2022, "Large Language Models are Zero-Shot Reasoners") showed that zero-shot CoT, triggered only by "let's think step by step," captured most of the gain without needing worked examples. Wang et al. (2022, "Self-Consistency Improves Chain of Thought Reasoning") showed that sampling multiple independent chains and taking the majority answer pushed accuracy higher still, at the cost of additional inference calls. These results together define the technique's operating range: large model, verifiable task, cost is acceptable.

The CoT Decision Gate: three questions before you write a prompt

Before reaching for CoT, apply a three-question gate that takes roughly thirty seconds and prevents the most common misapplications. The gate lives in the SVG diagram above; the logic is as follows.

Named Asset
The CoT Decision Gate
  1. 1 Is the task multi-step? Does the correct answer require at least two dependent operations, where a mistake in step one would corrupt step two? If no: skip CoT. A model generating a chain around a single-hop retrieval is padding, not reasoning.
  2. 2 Is the answer verifiable? Can you independently check whether the intermediate steps are correct, either by running code, applying a formula, or consulting a primary source? If no: use CoT cautiously and label the output as unverified. An unverifiable chain can create the illusion of rigor without the substance.
  3. 3 Is the latency and cost acceptable? A CoT chain adds 50 to 200 tokens of intermediate output per call, plus the tail latency of generating them. For batch-offline analysis that is irrelevant; for a sub-second user-facing endpoint it may not be.
Use CoT when
All three: yes. Multi-step task, verifiable answer, and latency is not a hard constraint. Math, code debugging, diagnostic reasoning, structured argument analysis.

When CoT helps vs hurts: task-type reference table

The Decision Gate above gives you a three-question filter. This table operationalizes it across eleven real task types, each with a one-line reason so you can apply the verdict without re-deriving it. Verdict labels are based on published benchmark results (Wei et al. 2022; Kojima et al. 2022) plus documented model behavior in current provider guides. Rows marked illustrative are constructed examples that generalize from those results; they are not from a specific benchmark.

Original data asset - CoT Decision Gate
When CoT helps vs hurts by task type
Task type Verdict Why (one line) Evidence basis
Multi-step arithmetic (e.g., compound interest, unit conversions) Helps Each step depends on the prior one; a silent carry error compounds without a visible chain to catch it. Wei et al. 2022: 10-40 pp accuracy gain on GSM8K and similar benchmarks in large models.
Multi-hop logical reasoning (e.g., syllogisms, causal inference chains) Helps The chain externalizes the inference graph, letting the model track which premises have been used. Wei et al. 2022: StrategyQA and CommonsenseQA gains; Kojima et al. 2022 zero-shot confirmation.
Code debugging (trace a bug across multiple call sites) Helps Stepping through execution state explicitly reduces the chance the model skips to a wrong fix. Anthropic Claude docs and OpenAI prompt engineering guide both recommend CoT for debugging tasks. illustrative
Legal/regulatory compliance check Helps Each jurisdiction rule is an independent check; sequencing them produces an auditable gap map. Consistent with multi-hop reasoning gains; worked example in this article. illustrative
Medical or clinical differential Helps (with verification) Listing and eliminating differentials step by step reduces early anchoring on the first plausible diagnosis. General CoT pattern applies; always verify chain against clinical sources independently. illustrative
Simple factual lookup (e.g., capital cities, unit definitions) Hurts or wastes tokens The answer is a single retrieval; the chain pads around it and occasionally introduces an error. Chain inflation failure mode documented in Wei et al. 2022 and reproduced consistently in smaller models.
Single-label text classification (e.g., spam vs not spam, sentiment) Hurts or wastes tokens Classification is a single mapping from input to label; reasoning steps add latency without changing accuracy on well-trained classifiers. CoT gains in classification are inconsistent and often negative in benchmarks; direct prompting preferred for production classifiers. illustrative
Summarization of a single document Hurts or wastes tokens Summarization is extractive/compressive, not inferential; a reasoning chain before a summary adds overhead without a quality benefit. Consistent with "no multi-step dependency" gate criterion; CoT gains are task-specific. illustrative
Latency-sensitive API endpoint (sub-200 ms SLA) Hurts A 50-200 token CoT chain adds wall-clock latency proportional to its length; at sub-200 ms SLAs that is user-hostile. Gate Q3 (latency constraint). Token generation latency is linear in output length for autoregressive models.
Open-ended creative generation (fiction, poetry, ideation) Usually hurts Forcing a sequential reasoning structure before a creative output suppresses the associative leaps that produce interesting results. Qualitative pattern; exception is structured creative tasks (outlines, argument maps) where the chain is the product. illustrative
Structured argument analysis (e.g., identify premises, evaluate validity) Helps (with caveats) The chain is the product here; but without an anti-fabrication clause the model may invent authoritative-looking premises that were not in the source. Consistent with multi-hop gains; anti-fabrication discipline required. illustrative
Accuracy ranges (10-40 pp) are from Wei et al. 2022 Table 2 on GSM8K, SVAMP, AQuA, StrategyQA, and CommonsenseQA at 540B (PaLM) scale. Rows marked illustrative generalize from those results to task types not directly benchmarked in that paper; treat them as informed heuristics, not empirical claims. All claims verified June 2026 against cited sources.

Does zero-shot CoT or few-shot CoT perform better?

Few-shot CoT is more reliable; zero-shot CoT is faster to deploy. The practical split: start with zero-shot by appending "think through this step by step before answering" to your prompt. If the model's chain structure is inconsistent, the steps are sparse, or the format of intermediate results keeps varying, promote to few-shot by prepending two to three worked examples that each show the full chain from question through each step to answer. The added tokens per example are the cost. The benefit is that the model inherits the chain style you demonstrate, which produces cleaner output and makes it easier to parse the steps programmatically if you need to.

One non-obvious pattern: include one deliberately plain example among your few-shot set. If all your examples show impressive multi-step chains, the model anchors to high complexity and may add spurious steps to simple sub-tasks. A single clear, minimal chain in the example set anchors quality at both ends. Call this the sanity baseline: the model learns what a correct short chain looks like, not just what a long one looks like.

Worked prompt examples: bad vs CoT-built

The clearest way to see the difference is side by side. The task: determine whether a proposed SaaS refund policy is legally safe to publish in the UK and EU.

Mock prompt comparison: same task, different structure illustrative
Without CoT
Review this refund policy and tell me if it is legally safe for UK and EU customers.

[policy text]

The model jumps to a verdict. It may be correct, but you have no view into whether it checked the Consumer Rights Act 2015, the EU Consumer Rights Directive, digital goods rules, or the 14-day cooling-off period. A mistake anywhere is invisible.

With CoT (zero-shot trigger + structure)
You are a consumer-law compliance analyst specialising in UK and EU digital goods regulation.

Review the refund policy below. Before you give a verdict, work through these steps explicitly:
1. Identify which jurisdiction rules apply (UK Consumer Rights Act 2015, EU Consumer Rights Directive 2011/83/EU, digital goods exceptions).
2. Check each clause against the mandatory disclosures required in each jurisdiction.
3. Flag any clause that is missing, ambiguous, or conflicts with mandatory rights.
4. Only then give a verdict: COMPLIANT / NEEDS REVISION / NON-COMPLIANT, with a one-line reason per flag.

Do not invent case citations; name the statutory instrument and clause if you reference law, and label any inference as [inferred].

[policy text]

Now the chain is auditable. If the model misses the 14-day cooling-off period, you can see the gap in step two rather than finding it when a customer files a chargeback.

Notice what the CoT-built version adds beyond the step instruction: a specific named role with jurisdiction expertise (not "you are a legal expert"), an output format that names verdict categories, and an explicit anti-fabrication clause that limits the model to named statutes. That combination of techniques is what makes CoT actually useful in production rather than theatrical. For a complete treatment of the role, format, and anti-fabrication layers, see the RAILS framework guide.

Runnable zero-shot CoT template
You are a {{named_competence}} with expertise in {{specific_domain}}.

Task: {{task_description}}

Before answering, think step by step:
- Step 1: {{first_dependency}}
- Step 2: {{second_dependency}}
- Step N: draw a conclusion from the above steps only.

Output format: {{verdict_field}} | {{reasoning_field}} | {{confidence_field}}

Anti-fabrication rule: name the evidence type you are drawing on (document, formula, statute). Mark any inference [inferred]. If you cannot complete a step with available information, state that and stop.

{{input}}

How does CoT connect to the RAILS Loop?

Chain-of-thought is the reasoning engine; the RAILS Loop is the quality control layer that sits on top of it. The Loop pattern, which we treat as the highest-leverage structural move in reusable prompt design, appends a self-scoring rubric to the end of the prompt and instructs the model to revise and re-score if it falls below a threshold. Applied to a CoT prompt, this means the model first generates its step-by-step chain, scores it against the rubric (slop density, argument clarity, step completeness), and revises the chain if needed before emitting the final output. The combination is more reliable than either technique alone because the chain gives you something auditable to score and the loop forces a revision cycle when the chain is weak.

Here is the Loop appended to a CoT prompt:

CoT + RAILS Loop combined
[Role, task description, step-by-step instruction, output format, anti-fabrication clause as above]

After completing your chain, score your output:
- Step completeness (0-3): did you address every required step?
- Evidence quality (0-3): did every claim cite a source type or flag as [inferred]?
- Argument clarity (0-3): would a non-expert follow the logic?

If your total score is below 7 out of 9, revise the chain and re-score before emitting the final answer. Show the score alongside the output.

The Loop instruction costs roughly 60 additional tokens per call. In batch analysis or high-stakes generation (contracts, medical summaries, code review) those tokens pay for themselves many times over. The RAILS framework is documented in full in our complete prompt engineering guide.

Where does chain-of-thought prompting fail?

The four failure modes are well-documented and avoidable if you know to look for them.

Keeping the chain honest: anti-fabrication discipline for CoT prompts

Chain-of-thought is where hallucinations go to look credible. A wrong number dropped directly in an answer is visibly suspicious. The same wrong number embedded in step three of a seven-step reasoning chain inherits the authority of steps one and two. The discipline for CoT prompts is the same as for any prompt, but the stakes of skipping it are higher.

Three rules that pay their weight:

  1. Tell the model what evidence type to cite, not what evidence to produce. "Name the statute and clause" or "cite the formula you are applying" asks for a pointer, not a fabrication.
  2. Require the model to label inference as [inferred] and distinguish it from citation.
  3. Demand that the model stop and report rather than guess when a required step cannot be completed with available information. "If you cannot verify step three, say so and explain what source would be needed" produces a useful gap marker rather than a fabricated bridge.

These constraints are part of the broader anti-fabrication layer in the RAILS framework, which is one of the five rules in that system. They apply to all prompts; they are just more urgent in CoT because the chain provides cover for hallucinations that direct-answer prompts do not.

Who should not use chain-of-thought prompting?

Honest anti-recommendation. CoT is not the default mode for all prompts, and adding it indiscriminately is a common mistake made by practitioners who read the 2022 results and applied them universally.

Frequently asked questions

What is chain-of-thought prompting?
Chain-of-thought prompting asks a large language model to show its intermediate reasoning steps before producing a final answer. It was described by Wei et al. at Google Research in 2022 and shown to improve accuracy on multi-step arithmetic and symbolic reasoning tasks significantly in large models. The simplest form appends "think step by step" to any prompt.
When does chain-of-thought prompting help?
Chain-of-thought helps when a task has three properties: it requires multiple dependent steps (where a mistake in step two would corrupt step three), the final answer is verifiable, and latency and token cost are acceptable. Multi-step math, code debugging, legal reasoning, and diagnosis-style analysis are strong fits. Single-step lookups, pure summarization, and latency-sensitive production endpoints are poor fits.
What is the difference between zero-shot and few-shot chain-of-thought?
Zero-shot CoT appends a phrase like "let's think step by step" without worked examples. It is simple to deploy and costs no extra prompt tokens beyond the trigger phrase. Few-shot CoT provides two to five worked examples showing the full reasoning chain. Few-shot produces more consistent formatting and handles domain-specific reasoning better, but adds tokens per call. Start with zero-shot; upgrade to few-shot when the model's chain structure is sloppy or inconsistent.
Does chain-of-thought prompting work on smaller models?
CoT gains are concentrated in larger models. Wei et al. 2022 found the benefit emerged reliably in models at or above roughly 100 billion parameters. Smaller models can produce plausible-looking chains that contain incorrect intermediate steps. If you are on a smaller or free-tier model, verify intermediate steps independently rather than trusting the chain at face value.
Can chain-of-thought make outputs worse?
Yes. CoT can hurt on single-hop lookups and creative tasks where the model pads a chain around a straightforward answer, occasionally talking itself into an error. It also adds latency and token cost. The CoT Decision Gate three-question test helps you decide before committing.

Bottom line

Chain-of-thought prompting is a genuine capability amplifier on the tasks it is designed for: multi-step reasoning, verifiable arithmetic, causal analysis. The Wei et al. 2022 result is real and the technique is worth knowing. But it is not a universal upgrade switch. Applied to the wrong task, on the wrong model, or without an anti-fabrication clause, it produces outputs that are harder to correct than a straightforward wrong answer would have been. Apply the CoT Decision Gate before you write the prompt, pair it with the RAILS Loop when accuracy matters, and always tell the model what evidence type to cite rather than letting it invent one. That combination, taught step by step, is how CoT graduates from a blog-post trick to a production-grade technique.

This article is part of the RAILS prompt engineering series at Nesyona. For evaluation patterns, see our sibling spoke on how to evaluate AI models and outputs.

  1. Wei et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903 primary source.
  2. Kojima et al. (2022). "Large Language Models are Zero-Shot Reasoners." arXiv:2205.11916 primary source.
  3. Wang et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171 primary source.
  4. OpenAI. "Prompt Engineering Guide." platform.openai.com verified Jun 2026.
  5. Anthropic. "Chain-of-Thought Prompting." Claude Documentation. verified Jun 2026.
Disclosure: Nesyona is reader-supported. This article contains no sponsored placements. The RAILS template pack referenced in the email capture is our own free resource. Editorial standards.
Save
Dashboard

From our network

Best AI Tools for Amazon Sellers - bagengine.comBest AI Courses 2026 - edubracket.comBest Accounting Software for Online Sellers - ceocult.com