Updated June 2026 · 11 min read · Part of the RAILS prompt engineering series

How to test and evaluate your prompts: the Self-Critique Loop (2026)

A prompt that produces good output once is luck. A prompt that reliably produces good output is one that was tested. Most people skip testing entirely, treating prompts as something you run and hope for the best. This article teaches the single highest-leverage evaluation technique in prompt engineering: appending a scoring rubric directly inside the prompt and instructing the model to revise its own output until it clears the bar. We call this the Self-Critique Rubric Pattern. It requires no external tooling, no eval framework, and no second model. It works on every major model family. It is, by a wide margin, the most underused move in the field. This article is part of our complete prompt engineering guide.

Last reviewed: June 2026 Next review: December 2026
Bottom line up front
Table of contents
  1. What prompt evaluation actually means
  2. The Self-Critique Rubric Pattern defined
  3. The three scoring dimensions
  4. Worked example: before and after
  5. Anatomy of a testable prompt
  6. Why ban lists outperform positive-only instructions
  7. How to set the right threshold
  8. How this differs from academic self-refinement
  9. Does this work the same on every model?
  10. When to promote a tested prompt
  11. FAQ
  12. Bottom line

What does it actually mean to evaluate a prompt?

Most people evaluate prompts through a single pass of gut feel: they run the prompt, read the output, decide it is good enough or not, and either ship it or tweak it by hand. The problem is that this process does not scale. A prompt that felt good last week, on one input, may perform badly on a different input today. Gut feel is not a repeatable measure.

Prompt evaluation, done rigorously, means at least three things. First, running the prompt against a representative spread of inputs, not just the happy-path case you designed it for. Second, scoring the outputs against defined, measurable criteria rather than vibes. Third, knowing in advance what score constitutes "passing" so you are not renegotiating the bar every time you see output you like.

External eval frameworks like the OpenAI evals toolkit and the Anthropic model evaluation documentation are the right answer for production systems. But for the overwhelming majority of prompt practitioners, the marginal gain from the Self-Critique Rubric Pattern is far larger than the marginal gain from deploying a full eval harness. You can implement the pattern in two minutes. The harness takes two days.

What is the Self-Critique Rubric Pattern?

The Self-Critique Rubric Pattern is a prompt construction technique in which you embed a scoring rubric directly inside the prompt, specify a minimum passing score, and include a conditional instruction that tells the model to revise its output and re-score if the initial score falls below the threshold. The entire evaluation loop happens in a single inference call. You get back the output, the score, and a brief explanation of where it fell short.

The mechanism relies on a real asymmetry in how large language models work: generative defaults are worse than evaluative capability. A model that would not spontaneously produce a well-structured argument can often identify exactly what is wrong with a poorly-structured argument when asked. The rubric asks it to apply that better judgment to its own draft.

This is not a new idea in principle. What makes it underused is that practitioners think of evaluation as something that happens after the prompt runs, in a separate system. Moving evaluation inside the prompt removes that friction entirely.

The Self-Critique Rubric Pattern: minimal template
# ROLE
You are a senior technical writer with deep knowledge of [domain].

# TASK
[Your task instruction here. Be specific about what you want produced.]

# OUTPUT FORMAT
[Specify exact structure: sections, length, schema, whatever applies.]

# FORBIDDEN PATTERNS
Do not use: "it is worth noting", "in conclusion", "delve", "tapestry",
em-dashes, vague assertions without examples, claims without evidence type.

# SELF-CRITIQUE RUBRIC
After producing your draft, score it on these three dimensions (1-5 each):

  SLOP DENSITY   5 = zero filler sentences, every sentence carries a claim.
                 3 = fewer than 10% filler. 1 = more than 25% filler.

  EXAMPLE DENSITY 5 = every abstract claim has a concrete example.
                 3 = most claims have examples. 1 = mostly assertions.

  ARGUMENT CLARITY 5 = central claim statable in one sentence, supported.
                 3 = central claim findable but buried. 1 = no clear claim.

If any dimension scores below 4, identify the weak sentences and revise
the full output before returning. Return the final output, then the scores,
then a one-sentence note on any remaining gap.

That is the complete pattern. Everything else in this article is explanation of why each part exists and how to calibrate it.

What are the three scoring dimensions and why these three?

The three dimensions in the template above were chosen because they map to the most common failure modes in model-generated prose, they are specific enough for a model to apply consistently, and they are independent of domain. A rubric with ten dimensions is not ten times better than one with three; it is worse, because the model will score mechanically rather than reading carefully.

DimensionWhat it measuresScore 5Score 3Score 1
Slop densityThe proportion of sentences that contain no unique information, claim, or example. Hedge language, restatements, and filler qualify as slop.Zero filler. Every sentence earns its place.Under 10% filler. Readable but padded.Over 25% filler. Majority of sentences are restatements.
Example densityThe ratio of concrete, specific examples to abstract assertions. "AI tools can improve productivity" has zero example density. "Using a parameterized prompt cut the revision cycle from four rounds to one for our technical content team" has high example density.Every abstract claim accompanied by a specific example.Most claims have examples. A few assertions left hanging.Mostly assertions. Reader must take claims on faith.
Argument clarityWhether the central claim is statable in one sentence and whether the piece is organized to support it. A piece with high argument clarity has a thesis; a piece with low argument clarity is a list of loosely related observations.Central claim statable in one sentence. Structure visibly supports it.Central claim findable but buried. Some supporting structure.No identifiable central claim. Observations not organized toward a conclusion.
Worked scored example Self-Critique Rubric in action

Sample output under review

Task: explain why RAG (retrieval-augmented generation) reduces hallucination for a non-technical audience. Illustrative output, presented as it might arrive before self-critique runs.

Draft outputRAG is a really interesting and powerful approach to reducing hallucinations in AI systems. It is worth noting that hallucinations are a significant problem in generative AI. RAG works by giving the model access to external information, which can be very helpful. This means the model can look things up rather than relying solely on what it has memorized during training. RAG can improve the accuracy of AI responses in many different contexts. In conclusion, if you are concerned about hallucinations, RAG is definitely something to consider exploring for your use case.

Rubric scores applied to this draft

Dimension Score Evidence from the draft Verdict
Slop density 2 / 5 "It is worth noting" (banned filler). "very helpful", "really interesting", "definitely something to consider" carry no information. Five of seven sentences contain no falsifiable claim. "In conclusion" restatement adds nothing. Fail
Example density 1 / 5 Zero concrete examples anywhere. "External information" is undefined. No tool names (e.g. a customer-support bot querying a knowledge base), no numbers, no comparisons. The reader cannot picture what RAG actually does. Fail
Argument clarity 3 / 5 A central claim exists (RAG reduces hallucination by consulting external sources) but it is spread across three separate sentences, buried under filler, and never stated in one clean sentence. A reader skimming would not extract it reliably. Borderline
Total 6 / 15 Two dimensions below threshold of 4. Revision required on slop and example density before argument clarity can be assessed cleanly. REVISE
What the model should do next (the revise clause fires) The model identifies the two failing dimensions and targets them: strip the filler sentences, name a concrete example (e.g. "a legal research tool that queries 10,000 case documents in real time"), replace "gives the model access to external information" with a specific mechanism sentence, and consolidate the central claim into the opening sentence. Re-score. If slop and example density both reach 4+, return the revised output. If either is still below 4, revise once more (up to the maximum revision count set in the prompt).

For domain-specific work, you can add a fourth dimension without breaking the pattern. For code outputs, add "edge case coverage" (does the code handle empty input, type mismatches, and the documented boundary conditions?). For research summaries, add "evidence type specificity" (does the output name the type of evidence it cites rather than asserting claims as facts?). Keep the total at four or fewer.

Worked example: a prompt before and after the pattern

The clearest way to see the pattern's effect is to compare outputs from the same task with and without it. The task below is a common one: summarizing the tradeoffs of a technical decision for a non-technical audience.

Without self-critique rubric
Nesyona test
Prompt (no rubric)
Explain the tradeoffs of using a vector database versus a traditional relational database for semantic search. Write for a non-technical product manager.
Output

When deciding between a vector database and a traditional relational database for semantic search, there are several important tradeoffs to consider. Both approaches have their advantages and disadvantages, and the right choice depends on your specific needs and requirements.

Vector databases are designed to handle high-dimensional data and are optimized for similarity search. They can be very powerful for use cases like semantic search. However, they may require more technical expertise to set up and maintain.

Traditional relational databases are well-established and widely understood. They are generally easier to manage for teams that already have SQL expertise. However, they are not natively optimized for semantic similarity queries, which can lead to performance issues at scale.

In conclusion, the best choice depends on your team's expertise, your performance requirements, and the specific nature of your search use case.

Rubric check (illustrative): Slop: 2/5 (four of eight sentences contain no claim). Example density: 1/5 (zero concrete examples). Argument clarity: 2/5 (ends with "it depends" rather than a position). Would not clear a threshold of 4/4/4.
With self-critique rubric appended
Nesyona test
Prompt (with rubric, same task)
Explain the tradeoffs of using a vector database versus a traditional relational database for semantic search. Write for a non-technical product manager. FORBIDDEN: "it depends", "in conclusion", vague adjectives like "powerful" without a number. SELF-CRITIQUE RUBRIC (score 1-5 each, revise and re-score if any dimension is below 4): - SLOP DENSITY: 5 = every sentence carries a claim. 1 = majority are restatements. - EXAMPLE DENSITY: 5 = every abstract claim has a specific example. 1 = mostly assertions. - ARGUMENT CLARITY: 5 = central claim statable in one sentence. 1 = no central claim.
Output (after self-revision)

For semantic search on a catalog of more than roughly 100,000 items, a vector database outperforms a relational database on query speed by an order of magnitude. The reason is structural: relational databases compare text rows using exact or pattern-match logic (SQL LIKE or full-text indexes), while vector databases compare high-dimensional numerical representations of meaning, so a query for "lightweight rain jacket" also surfaces "packable windbreaker" without any keyword overlap.

The cost of that capability is operational complexity. Pinecone and Weaviate, two common choices, require maintaining a separate data pipeline to generate and index the embeddings whenever your catalog changes. A team already running PostgreSQL can add the pgvector extension and stay within a familiar system, accepting a query-speed tradeoff that only becomes a problem above roughly half a million items or under 200-millisecond latency requirements.

The clearest decision rule: if your PM can define "semantic" success with a list of five query-to-result pairs and those pairs require understanding synonyms or intent rather than keywords, benchmark pgvector first. Upgrade to a dedicated vector store if query latency exceeds your product's requirement on your actual catalog size, not before.

Rubric scores (illustrative): Slop: 5/5. Example density: 4/5 (concrete tools and scale thresholds cited; one claim about "familiar system" could use a benchmark). Argument clarity: 5/5 (central claim: use pgvector first, upgrade when latency exceeds requirement). Clears threshold.

The second output is not just better formatted. It makes specific claims (100,000 items, 200-millisecond latency, Pinecone and Weaviate), gives the product manager a concrete decision rule, and arrives at a position rather than a hedge. The rubric forced the model to notice it had not done those things in its first draft and fix them before returning output.

What does an anatomy of a testable prompt look like?

The Self-Critique Rubric Pattern is most effective when the prompt around it is also well-structured. A rubric appended to a vague, unstructured prompt will raise the floor but cannot fully compensate for the prompt's other weaknesses. The four layers of a testable prompt build on each other.

Layer 1
Instruction
The task, stated with a named role and a specific competence. Not "write a summary" but "you are a senior technical writer with five years covering distributed systems: summarize the following for a CTO who already knows what a database is."
ROLE + TASK
Layer 2
Context
Variable input slots that make the prompt reusable across inputs. Use double-brace slots: {{source_text}}, {{target_audience}}, {{word_limit}}. Decide which inputs are parameterized and which are hardcoded to the task definition.
{{context}} {{voice}} {{schema}}
Layer 3
Format + ban list
An explicit output schema (sections, keys, word counts, table structure) and a list of forbidden patterns. Both are binary and checkable, which is why they are more reliable than purely positive instructions.
OUTPUT FORMAT + FORBIDDEN
Layer 4
Rubric + guardrail
The self-critique rubric (score, revise if below threshold) plus a refusal clause: "If the input is missing critical information or would require fabricating specific numbers, state what information is needed instead of proceeding." This is the safety layer.
RUBRIC + GUARDRAIL

This four-layer structure maps to a broader principle in the RAILS framework (the five-rule structure we cover in our complete prompt engineering guide): Role, Architecture, Instructions, Loop, Safety. The Self-Critique Rubric Pattern is the Loop component: it closes the feedback path so the model corrects its own output rather than waiting for you to notice the error and re-run. For a deeper look at the instruction layer specifically, see our guide on how to write a system prompt.

Why do ban lists outperform positive-only instructions?

Positive instructions set a direction; ban lists close the floor. Telling the model to "be specific" leaves the definition of specificity to the model's defaults, which vary. Telling it not to make a claim without citing the evidence type is binary: either the claim cites an evidence type or it does not. The model can self-check a ban list. It cannot easily self-check an abstract instruction to "be better."

The practical comparison: a prompt with the instruction "write concisely" will routinely produce outputs with unnecessary preamble because the model's training corpus contains enormous amounts of preamble and "write concisely" does not specifically forbid any of it. The same prompt with "do not begin with a sentence that restates the question" removes the most common form of preamble immediately and completely.

Build your ban list from your own output history. Run your prompt ten times across different inputs. Every phrase that appears in outputs you reject gets added to the ban list. The list is personalized to the failure patterns of your specific prompt, which makes it much more effective than a generic list of AI cliches. Start with five to eight items and grow it empirically.

A few ban-list items are close to universal for prose prompts. The phrase "it is worth noting" is filler in every context because it provides no information about what follows. "In conclusion" is almost always a cue for a restatement that adds nothing. "Delve" and "tapestry" are statistical AI tells that signal training-corpus regurgitation. None of these are useful. All of them can be banned without any cost to output quality.

For code prompts, the ban list looks different: no SELECT *, no bare except: pass, no magic numbers without a named constant. The principle is the same: ban the patterns that signal shortcuts or low-quality defaults.

How do you set the right revision threshold?

The threshold is the minimum score on any dimension below which the model revises. Setting it to 4 out of 5 on all three dimensions is a reasonable starting point for most use cases. It is high enough to filter genuinely poor outputs and low enough that the model does not spin indefinitely on a task that is inherently ambiguous.

Do not set all dimensions to 5/5 by default. A threshold of 5 on example density will cause the model to pad outputs with examples to satisfy the rubric, often at the cost of argument clarity. The rubric is a minimum floor, not an optimization target. Set it at the point where outputs below that score are actually bad for your use case, not at the highest theoretically achievable score.

One useful calibration exercise: score ten of your existing outputs manually against your rubric. Find the score below which you have consistently rejected outputs in the past. Set your threshold there. If you have never rejected an output that scored 3/5 on slop density, your actual threshold for that dimension is probably 2, not 4, and an overly high threshold will waste model inference.

For longer or more complex tasks, you can add a maximum revision count: "revise at most twice." This prevents the model from entering a long revision loop on a task where the rubric criteria are in genuine tension with each other (high example density can reduce argument clarity if the examples overwhelm the structure).

How does this pattern differ from academic self-refinement research?

The Self-Critique Rubric Pattern shares surface structure with the self-refinement paradigm studied in AI research, but the two serve different goals and make different commitments. Madaan et al. (2023), "Self-Refine: Iterative Refinement with Self-Feedback", demonstrated that models can improve outputs through iterative self-generated feedback across tasks including math reasoning, code generation, and dialogue. Their setup uses separate generate, feedback, and refine steps, often across multiple rounds, with the model supplying natural-language critique as an intermediate artifact.

Three differences matter for practitioners. First, self-refinement as studied by Madaan et al. is a multi-call pipeline: generate, then critique, then revise, with the feedback step producing an intermediate string that is fed back into a new inference call. The Self-Critique Rubric Pattern collapses this into a single call by embedding the evaluation criteria up front. You trade iteration depth for operational simplicity. Second, academic self-refinement research typically leaves the quality criteria implicit, relying on the model's prior knowledge of what "good" means for the task. The rubric pattern makes the criteria explicit and numeric, which is what makes the threshold meaningful and repeatable. Third, self-refinement benchmarks focus on correctness improvements on tasks with verifiable answers (math, code). The three-dimension rubric here targets prose quality, where correctness is not a useful frame. The slop density, example density, and argument clarity dimensions are designed for this space specifically, not derived from the broader literature. The ban-list-as-floor argument, in particular, has no direct analog in the self-refinement literature; it is a practitioner invention that closes a gap the academic framing leaves open.

Does the Self-Critique Rubric Pattern work the same on every model?

The pattern works on every major current model but with different calibration requirements. Frontier models (GPT-4 class, Claude 3 Sonnet and above, Gemini 1.5 Pro) follow the rubric instructions reliably and produce coherent self-critiques. Smaller or quantized models follow the instruction format but their self-critique accuracy is lower: they will sometimes score a weak output as 5/5 and miss obvious problems. The rubric is most effective when the model has enough context-following capability to reliably apply a three-criteria framework.

The OpenAI model documentation and the Anthropic model overview both note that instruction-following performance correlates with model size. For self-critique rubrics specifically, this matters: a model that is borderline on instruction-following will follow the rubric format syntactically but apply the criteria loosely. If you are running on a smaller model and the rubric does not seem to be catching the problems you expect, simplify the rubric to a single dimension first and verify it is being applied before adding more.

On reasoning models (o-series from OpenAI, extended thinking from Anthropic), the internal chain-of-thought that precedes the visible output functions as a natural self-critique pass. You still benefit from appending the rubric, because it names the criteria explicitly rather than leaving them implicit, but the revision step is less frequently triggered because the model's generation quality is already higher.

When should a tested prompt be promoted?

The Self-Critique Rubric Pattern is the most important evaluation technique at the individual-prompt level. But at some point, running the same prompt repeatedly becomes an argument for a more durable structure. The useful signal is repetition: when you have run the same prompt, with minor variations, more than three times, the prompt has revealed what it needs to work reliably. That is the moment to promote it.

Promotion means four things: adding typed variable slots for every input that changed across runs, pinning a version number, writing down what the rubric threshold was and why, and naming the failure modes you observed before promotion. A promoted prompt is testable in a way a one-off prompt never is. You can run it against a consistent set of test inputs and track whether its scores degrade over time.

A prompt you have tested and scored is worth keeping and versioning. We put rated, versioned prompts and brains in BrainBoot's marketplace (our own Prompt OS) for exactly that reason. The marketplace is where promoted prompts that cleared a rubric threshold live alongside their documented test inputs and scores, so you can reuse them without rebuilding from scratch each time.

Frequently asked questions

What is the Self-Critique Rubric Pattern in prompt engineering?
The Self-Critique Rubric Pattern is a technique where you append a scoring rubric directly inside your prompt and instruct the model to score its own output against that rubric before returning it to you. You set a minimum passing score, and the prompt instructs the model to revise and re-score if the initial output falls below that threshold. This transforms a static prompt into a self-improving loop without any external tooling.
How do you evaluate whether a prompt is working?
Evaluate a prompt on three axes: argument clarity (is the central claim stated plainly and supported?), example density (are abstract claims grounded in concrete, specific examples rather than vague generalizations?), and slop density (how many filler phrases, hedge words, and content-free sentences are present?). Score each axis on a 1-5 scale. A prompt that consistently produces outputs scoring 4 or above on all three axes across different inputs is working. A prompt that produces variable scores is under-specified and needs clearer instructions or a ban list.
What is the difference between a prompt and a brain in prompt engineering?
A prompt is a single string of text with no typed input slots, no output schema, no test suite, and no invariants. A brain is a self-contained cognitive unit that adds four components to the raw text: a system prompt, priority-ordered execution rules, a declared output format, and guardrails that define what the unit must never do. The dividing line is whether the unit is testable and parameterized. Once you add typed variable slots and a rubric you can run repeatably, the thing has graduated from a prompt to a brain.
Why do positive-only instructions fail in prompts?
Positive-only instructions tell the model what direction to walk but leave the floor undefined. Without an explicit ban list, the model fills ambiguity with its training defaults, which tend toward vague affirmations, hedge language, and the cliches that dominate its training corpus. A forbidden-pattern list closes the floor. Telling the model not to use phrases like "it is worth noting" or "in conclusion" is more reliably enforced than asking it to "be direct" because a ban is binary and checkable; a positive instruction is a gradient.
What scoring dimensions should a prompt rubric include?
The three highest-signal dimensions for general prompt rubrics are: slop density (the proportion of sentences that contain no unique information or claim), example density (the ratio of concrete specific examples to abstract assertions), and argument clarity (whether the central claim can be extracted in one sentence). For domain-specific prompts, add a fourth dimension relevant to the task: for code, edge-case handling; for research, evidence type specificity; for copy, specificity of benefit claims. Keep the total at four or fewer.

Bottom line

If you take one technique from this article, make it the Self-Critique Rubric Pattern. Append a rubric to your next prompt, set a threshold of 4 on each of the three dimensions, and observe what the model reports about its own draft before it revises. You will see, in plain text, exactly what it considered weak and what it changed. That diagnostic information alone is worth the two minutes it takes to add the rubric. From there, the progression is straightforward: identify your actual failure patterns, build a ban list from them, layer the four anatomy components around the rubric, and when you have run the prompt more than three times, promote it to a versioned, tested unit. That is the complete path from "hoping the output is good" to "knowing it is." For the full RAILS framework that this technique sits inside, see our complete prompt engineering guide.

  1. Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
  2. Madaan et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
  3. OpenAI Evals documentation (2024).
  4. Anthropic. Evaluating Claude: model evaluation documentation.
  5. Anthropic. Claude models overview (2026).
  6. OpenAI. Models and capabilities documentation (2026).
Disclosure: Nesyona is reader-supported. BrainBoot is our own Prompt OS product and is disclosed as such where linked. No vendor paid for placement in this article. Editorial standards.
Save
Dashboard

From our network

Best AI Tools for Amazon Sellers - bagengine.comBest AI Courses 2026 - edubracket.comBest Accounting Software for Online Sellers - ceocult.com