Prompt Engineering Updated June 2026 · 10 min read · Part of the RAILS prompt engineering series

Few-shot prompting: how many examples, and which ones (2026)

Q: How many examples should a few-shot prompt include?

Research by Wei et al. (2022) on chain-of-thought prompting and Brown et al. (2020) on GPT-3 few-shot learning consistently shows diminishing returns beyond 4 to 6 examples on most tasks. 2 to 3 examples are sufficient for format and tone anchoring. 4 to 6 examples help on tasks requiring multi-step reasoning. Beyond 8 examples, context-window cost rises faster than quality improves, and the model can begin overweighting the examples at the expense of the actual instruction. The exception is a classification task with many possible labels, where one example per label class is the right heuristic.

Q: Does the order of few-shot examples matter?

Yes, and this is well-documented. Lu et al. (2022) showed in their paper on prompt order sensitivity that the ordering of in-context examples can shift accuracy by a substantial margin on classification benchmarks, a phenomenon called recency bias. Models tend to anchor more heavily on the last example in a set. The practical implication is to place your strongest, most representative exemplar last, and to put the sanity-baseline exemplar in the middle of the set, not at the end where it would have outsize influence on the output register.

A few-shot prompt is a prompt that includes a small set of worked input-output demonstrations before the actual request. The model reads those examples, infers the intended format and register, and applies that pattern to your real query. The technique predates GPT-4 by years: Brown et al. (2020) documented it across 24 NLP tasks in the original GPT-3 paper. What most guides do not tell you is that exemplar selection matters at least as much as exemplar count, and that the single highest-leverage move has nothing to do with picking the best examples. It is about including one deliberately unremarkable one. This article teaches that move, names it, and shows you the full selection logic for building few-shot sections that actually hold across models.

Last reviewed: June 2026 Next review: December 2026

Bottom line up front

Definition: Few-shot prompting places 2 to 8 input-output demonstrations in the prompt so the model learns the desired pattern in-context, without a weight update.
The count rule: 2 to 3 examples suffice for format and tone. 4 to 6 help on multi-step reasoning. Beyond 8, context cost rises faster than quality improves. The full four-tier decision matrix (0-shot / 1-2 / 3-5 / 5+) with ordering notes is in the How many examples? section.
The non-obvious move: The Sanity-Baseline Rule. Include one intentionally plain exemplar among your polished ones to anchor the quality floor and prevent output drift toward stylistic overreach.
Order matters: Recency bias is documented. Put your strongest exemplar last, your sanity baseline in the middle.

Table of contents

What is few-shot prompting?
How many examples?
The Sanity-Baseline Rule
Which examples to choose
Does order matter?
Bad prompt vs RAILS-built prompt
Three anti-patterns to avoid
FAQ
Bottom line

What exactly is few-shot prompting?

Few-shot prompting means placing 2 or more worked demonstrations inside the prompt, before your actual query, so the model infers the desired pattern from those examples rather than from abstract rules alone. The name comes from the machine-learning literature's concept of learning from very few labeled samples, as opposed to supervised fine-tuning which requires thousands of labeled examples and a weight update. In a few-shot prompt, no weights change. The model processes the demonstrations as context and generalizes in-context to your request.

The technique sits on a spectrum. A zero-shot prompt gives no examples: just instructions or a question. A one-shot prompt includes exactly one demonstration. Few-shot starts at two demonstrations and typically runs to about eight before context cost starts winning the trade-off. Brown et al. (2020) established this vocabulary in the original GPT-3 paper, showing that even a handful of examples substantially narrowed the gap between zero-shot and fine-tuned performance on many NLP benchmarks. The mechanism is in-context learning: the model reads the examples, identifies the mapping being demonstrated, and applies it.

For structured tasks, few-shot is often the single highest-leverage prompt change available before you reach for a more complex technique like chain-of-thought. Wei et al. (2022) showed that pairing worked examples with step-by-step reasoning traces ("chain-of-thought prompting") dramatically improves multi-step arithmetic and commonsense reasoning on large models, combining the pattern-anchoring of few-shot with explicit reasoning scaffolding.

How many examples does a few-shot prompt actually need?

The research-backed sweet spot is 4 to 6 examples for tasks requiring structured reasoning, and 2 to 3 for tasks requiring only format or tone anchoring. The decision matrix below maps each tier to task fit, the point where diminishing returns set in, and the ordering move that matters at that tier.

Original decision matrix / Nesyona RAILS series

How many examples? A four-tier decision matrix

Each tier maps to a task profile, a diminishing-returns signal, and an ordering note grounded in recency-bias research (Lu et al. 2022). Gains listed are illustrative of documented patterns, not precise lab measurements.

Shot-count decision matrix: when each tier is right, when returns drop, and how order affects each tier
Tier	Task fit: use when	Expected quality lift	Diminishing-returns signal	Ordering note (recency bias)
0-shot	Open-ended generation; creative tasks with no fixed output shape; simple factual Q&A where format does not matter; prototyping before you have good exemplars	Baseline. No pattern anchor. Output format varies with phrasing.	N/A - no examples to add. If outputs are inconsistent, move to 1-2 shots.	No ordering decision. Instruction placement is the only lever: put the key constraint in the last line of the prompt, where recency bias works in your favor.
1-2 shots	Format and tone anchoring on a single output shape; classification with exactly 1-2 label classes; when context budget is tight; when you have only one truly representative example	Substantial lift on format consistency (Brown et al. 2020 documented >40 pp improvement moving from 0 to 1 shot on several NLP benchmarks). Edge cases still drift.	Returns plateau quickly if both examples are structurally identical. Signal: outputs for edge-case inputs still mismatch the target register.	With 1 example, there is no ordering decision. With 2 examples: put the more complex or representative example second. The model weights it more heavily. If one example is a sanity baseline, put it first here, not last.
3-5 shots	Multi-step reasoning tasks; extraction with a structured schema; classification with 3-5 label classes (one per class); tone calibration when outputs drift across input lengths; the Sanity-Baseline Rule kicks in here	Highest practical lift per token invested. Wei et al. (2022) showed chain-of-thought effects require at least 3 reasoning-trace exemplars to generalize reliably on large models. Quality gap vs. 6-8 shots is small; context cost gap is significant.	Returns drop when the 4th-5th example is structurally similar to existing ones. Signal: adding another example does not change outputs on your hardest test inputs.	This is where ordering matters most. Put the sanity-baseline exemplar in position 2 or 3 (never last). Put the hardest, most complex exemplar last. Order the rest simple-to-complex. Lu et al. (2022) showed this ordering reduces accuracy variance by keeping the recency-biased "anchor" aligned with your target ceiling.
5+ shots	Classification with many label classes (one per class is the heuristic, even if that pushes you to 8-12 examples); noisy domains where outlier examples must be explicitly represented; retrieval-augmented settings where examples are retrieved dynamically and context cost is managed separately	Marginal incremental lift beyond 5-6 on most tasks. Brown et al. (2020) showed gains flatten after ~4-6 demonstrations on most NLP tasks. Context cost rises linearly; quality does not.	Returns are effectively exhausted. Signal: running 5 vs. 8 examples on your test set produces indistinguishable outputs. At this point, reducing to 4-5 high-quality examples and strengthening the instruction is the better trade.	With a large set, cluster by difficulty: easy examples first, hard examples last. The final 2 examples have the most influence on output register (recency bias compounds across a longer sequence). If the task has multiple label classes, end with the label class you want the model to handle most carefully on ambiguous inputs.

Worked decision rule: Start at 0-shot. If output format is inconsistent, add 1-2 examples. If edge cases still drift, add to 3-5 and apply the Sanity-Baseline Rule (one plain exemplar, middle position). Only go above 5 if you have a multi-class classification task or documented evidence that specific outlier examples change outputs on your hardest inputs. Treat the count as a tunable parameter, not a fixed setting: run your 5 hardest expected inputs against 2-shot, 4-shot, and 6-shot versions of the same prompt and pick the crossing point where quality stops improving.

Shot count	Best for	Context cost	Typical outcome
0 (zero-shot)	Open-ended generation, creative tasks, simple factual Q&A	Minimal	Unpredictable format; relies entirely on instruction clarity
1 (one-shot)	Format anchoring when only one output shape exists	Low	Format improves; edge cases still drift
2-3 (few-shot)	Tone and format anchoring; classification with few labels	Low	Consistent register; adequate for most prose and structured tasks
4-6 (few-shot)	Multi-step reasoning; classification with many labels; extraction with complex schemas	Moderate	Highest quality on structured tasks; recommended ceiling for most use cases
7-8	Edge-case coverage; noisy domains requiring outlier examples	Moderate-high	Marginal quality gain; diminishing returns documented by Brown et al. 2020
10+	Rarely justified outside retrieval-augmented or test-time compute settings	High	Context cost rises faster than quality; risk of model over-anchoring on examples vs instruction

The exception to the general ceiling is a classification task with many distinct output labels: there the right heuristic is one example per label class, even if that pushes you above eight examples. The reasoning is that the model needs to see each valid output token at least once to reliably emit it. Below one-per-class, it tends to collapse to the labels it saw most often, which biases the output distribution.

The Sanity-Baseline Rule: the non-obvious move in exemplar selection

Named asset / Nesyona RAILS series

The Sanity-Baseline Rule

Definition: When building a few-shot section, deliberately include one plain, unremarkable exemplar alongside your high-quality demonstrations. Its job is not to inspire the model. Its job is to anchor the quality floor.

When every example in your set is polished, dense, and exceptional, the model learns the ceiling of your desired output style but has no signal for what ordinary, clear, functional prose looks like. The result is systematic upward drift: outputs that are stylistically overwrought, that strain for vocabulary, or that reach for complexity where simplicity was the right answer.

Sanity-Baseline Rule: Among your N exemplars, include exactly one input-output pair that is intentionally workmanlike. Not bad. Not wrong. Just plain. Its placement: middle of the set, not last. Its function: a calibration signal telling the model that a direct, simple answer is a fully acceptable output.

The "baseline" in the name is a statistics term: a measurement taken against the simplest plausible case. The sanity baseline exemplar does for your few-shot section what a control condition does for an experiment. It keeps the output distribution honest.

Here is the practical difference. Suppose you are building a few-shot section for a product-copy prompt. You include three examples, all of them your best-ever output: punchy hooks, vivid verbs, unexpected angles. The model sees only the ceiling. When it generates new copy from your actual input, it reaches for the same ceiling on every pass, including cases where the product is straightforward and the right copy is simply direct and clear. The output reads as tryhard.

Add one example that is honest and workmanlike: a clean sentence, a clear benefit statement, nothing embellished. The model now understands that calibration matters. It matches your high-quality examples when the input warrants it, and applies appropriate plainness when the input does not. The sanity baseline is not a quality compromise. It is a quality signal.

Which examples to choose beyond the sanity baseline

The rest of your exemplar set should maximize coverage of the problem space, not polish your personal favorites. The selection heuristics below are ordered by impact.

Cover the variance, not the center. If your few-shot section contains three examples and they are all structurally similar inputs, the model has learned one corner of the space. Choose examples that span the range of inputs you actually expect. If you are prompting for product reviews, include one short product and one feature-dense product, not three mid-tier products you happen to know well.

Include the hardest case you can construct. Deliberately write one exemplar around an input that is edge-case or borderline: a short input when most of yours will be long, a topic-switch, an input with an unusual constraint. This teaches the model how to handle the hard case rather than leaving it to generalize from the easy ones, which it will do badly.

Match the chain-of-thought requirement. If your task requires reasoning, each exemplar should show the reasoning steps, not just the final answer. Wei et al. (2022) showed this is the core mechanism behind chain-of-thought prompting: the exemplar demonstrates the reasoning trace, not just the output, and the model learns to reproduce that trace on new inputs. An exemplar that shows only the answer is not teaching reasoning; it is teaching answer format.

Do not include examples from a different output schema. This sounds obvious but is the most common few-shot mistake in production. If your target output format changed (you switched from a bullet-list to a JSON object, or from a 3-sentence summary to a 5-sentence one), old exemplars silently teach the wrong schema. Exemplars must be synchronized to the current output format spec. Stale few-shot sections are the prompt-engineering equivalent of stale documentation: they look authoritative and produce silent errors.

Does the order of examples in a few-shot set matter?

Yes. The order of examples demonstrably affects outputs, and the effect is large enough to change task accuracy by a substantial margin. Lu et al. (2022) documented this in "Fantastically Ordered Prompts and Where to Find Them," showing that on several classification benchmarks, different orderings of the same four exemplars could produce accuracy swings of over 30 percentage points on smaller models. The mechanism is known as recency bias: models weight the most recent context more heavily, so the last exemplar in your few-shot section disproportionately shapes the output.

The practical ordering rule follows directly from this:

Place your strongest, most representative exemplar last. It has the most influence on the output pattern.
Place the sanity-baseline exemplar in the middle of the set. Last position would give the plain exemplar outsized influence; first position means the model may under-weight it entirely.
If your exemplars vary in complexity, order them from least complex to most complex. This mirrors how a teacher sequences worked examples: simple case first, harder case last.

Larger models (GPT-4 class and above) are less sensitive to ordering than smaller ones. The GPT-4 technical report notes substantially improved calibration and instruction-following robustness compared to earlier models. However, recency bias is not zero on any model; the ordering guidelines above are safe defaults regardless of the model you are targeting. For model-specific prompt behavior, the Anthropic prompt engineering guide and the OpenAI prompt engineering guide both address model-family nuances worth reading before building production few-shot sections.

Bad prompt vs RAILS-built few-shot prompt: a worked comparison

The two mock sessions below show the same task: extracting a one-sentence verdict from a product review. The first prompt is how most people approach it. The second applies the full RAILS few-shot logic: coverage variance, sanity baseline in the middle, strongest exemplar last, schema contract explicit.

few-shot-bad.txt

NO RAILS

Extract a one-sentence verdict from this product review. Example 1: "The keyboard feels incredible. Every keystroke is satisfying and the build quality is exceptional. Absolutely worth the premium price." Verdict: "A premium keyboard with exceptional build quality and satisfying key feel." Example 2: "Stunning RGB, fantastic switches, love the wrist rest. This is the best keyboard I have ever owned and I have owned dozens." Verdict: "The best keyboard in the reviewer's experience, with standout RGB and premium switches." Example 3: "Unbeatable tactile response, gorgeous aluminum chassis, best in class software suite, incredible value even at this price point." Verdict: "An unbeatable tactile keyboard with best-in-class software and strong value despite its price." Now extract a verdict from: "It types okay. I use it every day and have no complaints."

Verdict: "A dependable daily-driver keyboard with a satisfying, trustworthy typing experience that consistently delivers for everyday use."

Problem: All three exemplars are ceiling-level praise. The model has no signal for restraint. A plain review about an ordinary keyboard gets a verdict that is three times more enthusiastic than the input warrants. The output is not wrong; it is miscalibrated.

few-shot-rails.txt

RAILS

Extract a one-sentence verdict from a product review. Output format: one sentence, starts with "Verdict:", neutral to positive register, no superlatives unless the review uses them. Example 1 [edge case, short input]: "It types okay. I use it every day and have no complaints." Verdict: "A reliable daily keyboard with no notable weaknesses." Example 2 [sanity baseline, mid-range input]: "Good keyboard for the price. Clicky switches and decent build quality. Nothing special but it does the job well." Verdict: "A solid budget keyboard with clicky switches that covers the basics without distinction." Example 3 [complex, enthusiast input]: "The keyboard feels incredible. Every keystroke is satisfying and the build quality is exceptional. The software suite is best in class and the aluminum chassis is gorgeous. Absolutely worth the premium price." Verdict: "A premium enthusiast keyboard with exceptional build quality, best-in-class software, and a price justified by its performance." Now extract a verdict from: "It types okay. I use it every day and have no complaints."

Verdict: "A reliable daily keyboard with no notable weaknesses."

Result: The Sanity-Baseline Rule works. The plain exemplar (Example 2) and the edge-case exemplar (Example 1) together anchor the floor. The output matches the register of the input, not the ceiling of the exemplar set. Note that the actual input was identical to Example 1: the model correctly generalizes the demonstrated output.

Get the RAILS template pack: 12 parameterized, version-pinned few-shot templates covering copy extraction, classification, structured output, and chain-of-thought reasoning.

Three anti-patterns that break few-shot sets in production

Most few-shot failures in production are not caused by picking the wrong number of examples. They are caused by one of three structural errors.

Anti-pattern 1: All exemplars from the same distribution. If every example in your set was drawn from the same narrow context (your best work, your personal favorites, your most recent projects), the model learns that distribution. It will handle inputs from that distribution well and generalize poorly to everything else. Vary across input length, topic, complexity, and edge-case type.

Anti-pattern 2: Stale exemplars after a schema change. When you change the output format, update every exemplar in the few-shot section before the next run. A prompt that instructs "output JSON with keys verdict, issues, rewrite" but shows examples outputting a three-bullet list creates a schema conflict. The model resolves this by alternating or averaging, producing output that satisfies neither specification. Treat the few-shot section as code: it needs to be versioned alongside the system prompt.

Anti-pattern 3: Exemplars without the reasoning trace on reasoning tasks. On tasks requiring multi-step inference (scoring, root-cause analysis, decision logic), an exemplar that shows only the final answer does not teach the model to reason. It teaches it to pattern-match the output token. The result is that it gets the same final-answer format on easy cases and fails silently on hard cases where the reasoning was actually needed. If the task requires reasoning, every exemplar must show the reasoning steps. This is the practical lesson Wei et al. (2022) demonstrated at scale: the trace is the exemplar, not the conclusion.

This article is part of our complete prompt engineering guide, which covers the full RAILS framework: Role, Architecture, Instructions, Loop, and Safety. The few-shot exemplar section lives inside the Architecture layer of a RAILS prompt, alongside output schema and variable slots. If you find yourself running the same few-shot prompt structure across multiple use cases, the next step is to parameterize the exemplar slots and promote the prompt to a version-pinned, testable unit. That promotion is exactly what we built BrainBoot to productize: it is our sister tool for teams who have outgrown ad-hoc prompts and need a structured prompt OS. Worth looking at once you are running the same prompt 3 or more times in production.

Frequently asked questions

What is few-shot prompting?

Few-shot prompting is a technique where you include a small number of worked input-output examples inside the prompt itself, before your actual query, so the model can infer the desired format, tone, and logic from those examples rather than from abstract instructions alone. The term comes from the machine-learning concept of learning from very few labeled samples. In practice, 2 to 5 examples placed before the real request substantially improve output consistency on structured tasks.

How many examples should a few-shot prompt include?

Research by Wei et al. (2022) and Brown et al. (2020) consistently shows diminishing returns beyond 4 to 6 examples on most tasks. 2 to 3 examples are sufficient for format and tone anchoring. 4 to 6 examples help on tasks requiring multi-step reasoning. Beyond 8 examples, context-window cost rises faster than quality improves, and the model can begin overweighting the examples at the expense of the actual instruction. The exception is a classification task with many possible labels, where one example per label class is the right heuristic.

What is the Sanity-Baseline Rule in few-shot prompting?

The Sanity-Baseline Rule is the practice of deliberately including one plain, unremarkable exemplar among your N examples, alongside the high-quality ones. Its job is not to inspire the model; it is to anchor the floor. When every exemplar is polished and exceptional, the model can drift toward writing that is stylistically dense or overwrought, because it has only seen the ceiling. Adding one intentionally simple, workmanlike example teaches the model that ordinary clarity is acceptable and prevents systematic overreach. It is a calibration signal, not a quality compromise.

Does the order of few-shot examples matter?

Yes, and this is well-documented. Lu et al. (2022) showed that the ordering of in-context examples can shift accuracy by a substantial margin on classification benchmarks, a phenomenon called recency bias. Models tend to anchor more heavily on the last example in a set. The practical implication: place your strongest, most representative exemplar last, and put the sanity-baseline exemplar in the middle of the set, not at the end where it would have outsized influence on the output register.

What is the difference between zero-shot, one-shot, and few-shot prompting?

Zero-shot prompting provides no examples at all: only instructions or a question. One-shot prompting includes exactly one input-output demonstration. Few-shot prompting includes 2 or more demonstrations, typically 2 to 8. The Brown et al. (2020) GPT-3 paper popularized this taxonomy. In practice, the boundary between one-shot and few-shot is informal; the important distinction is zero examples versus any examples, because even a single demonstration substantially changes the output pattern on structured tasks.

Bottom line

Few-shot prompting is not about filling in as many examples as the context window allows. The optimal count is 4 to 6 for structured reasoning tasks, 2 to 3 for format anchoring, and one per label class for classification. The non-obvious move is the Sanity-Baseline Rule: include one deliberately plain exemplar in the middle of the set to calibrate the output floor and prevent the model from treating your ceiling examples as the only valid register. Order the rest of the set from least to most complex, with your strongest exemplar last. Synchronize exemplars to the current output schema whenever the spec changes. And on tasks that require reasoning, every exemplar must show the reasoning trace, not just the conclusion. These five principles, combined, produce few-shot sections that stay reliable as the input distribution shifts in production.