How to test and evaluate your prompts: the Self-Critique Loop (2026)
A prompt that produces good output once is luck. A prompt that reliably produces good output is one that was tested. Most people skip testing entirely, treating prompts as something you run and hope for the best. This article teaches the single highest-leverage evaluation technique in prompt engineering: appending a scoring rubric directly inside the prompt and instructing the model to revise its own output until it clears the bar. We call this the Self-Critique Rubric Pattern. It requires no external tooling, no eval framework, and no second model. It works on every major model family. It is, by a wide margin, the most underused move in the field. This article is part of our complete prompt engineering guide.
- The move: append a scoring rubric and a "revise if below threshold" clause to any prompt. The model scores its draft, identifies where it fell short, and revises before returning output to you.
- Why it works: the model has good evaluative capability even when its generative defaults are mediocre. You are asking it to apply what it already knows about quality to its own work.
- The three scoring dimensions that matter most: slop density, example density, and argument clarity. Each scored 1 to 5 with explicit criteria.
- Positive instructions alone fail: they leave the floor undefined. A ban list plus a rubric closes it.
Table of contents
- What prompt evaluation actually means
- The Self-Critique Rubric Pattern defined
- The three scoring dimensions
- Worked example: before and after
- Anatomy of a testable prompt
- Why ban lists outperform positive-only instructions
- How to set the right threshold
- How this differs from academic self-refinement
- Does this work the same on every model?
- When to promote a tested prompt
- FAQ
- Bottom line
What does it actually mean to evaluate a prompt?
Most people evaluate prompts through a single pass of gut feel: they run the prompt, read the output, decide it is good enough or not, and either ship it or tweak it by hand. The problem is that this process does not scale. A prompt that felt good last week, on one input, may perform badly on a different input today. Gut feel is not a repeatable measure.
Prompt evaluation, done rigorously, means at least three things. First, running the prompt against a representative spread of inputs, not just the happy-path case you designed it for. Second, scoring the outputs against defined, measurable criteria rather than vibes. Third, knowing in advance what score constitutes "passing" so you are not renegotiating the bar every time you see output you like.
External eval frameworks like the OpenAI evals toolkit and the Anthropic model evaluation documentation are the right answer for production systems. But for the overwhelming majority of prompt practitioners, the marginal gain from the Self-Critique Rubric Pattern is far larger than the marginal gain from deploying a full eval harness. You can implement the pattern in two minutes. The harness takes two days.
What is the Self-Critique Rubric Pattern?
The Self-Critique Rubric Pattern is a prompt construction technique in which you embed a scoring rubric directly inside the prompt, specify a minimum passing score, and include a conditional instruction that tells the model to revise its output and re-score if the initial score falls below the threshold. The entire evaluation loop happens in a single inference call. You get back the output, the score, and a brief explanation of where it fell short.
The mechanism relies on a real asymmetry in how large language models work: generative defaults are worse than evaluative capability. A model that would not spontaneously produce a well-structured argument can often identify exactly what is wrong with a poorly-structured argument when asked. The rubric asks it to apply that better judgment to its own draft.
This is not a new idea in principle. What makes it underused is that practitioners think of evaluation as something that happens after the prompt runs, in a separate system. Moving evaluation inside the prompt removes that friction entirely.
# ROLE You are a senior technical writer with deep knowledge of [domain]. # TASK [Your task instruction here. Be specific about what you want produced.] # OUTPUT FORMAT [Specify exact structure: sections, length, schema, whatever applies.] # FORBIDDEN PATTERNS Do not use: "it is worth noting", "in conclusion", "delve", "tapestry", em-dashes, vague assertions without examples, claims without evidence type. # SELF-CRITIQUE RUBRIC After producing your draft, score it on these three dimensions (1-5 each): SLOP DENSITY 5 = zero filler sentences, every sentence carries a claim. 3 = fewer than 10% filler. 1 = more than 25% filler. EXAMPLE DENSITY 5 = every abstract claim has a concrete example. 3 = most claims have examples. 1 = mostly assertions. ARGUMENT CLARITY 5 = central claim statable in one sentence, supported. 3 = central claim findable but buried. 1 = no clear claim. If any dimension scores below 4, identify the weak sentences and revise the full output before returning. Return the final output, then the scores, then a one-sentence note on any remaining gap.
That is the complete pattern. Everything else in this article is explanation of why each part exists and how to calibrate it.
What are the three scoring dimensions and why these three?
The three dimensions in the template above were chosen because they map to the most common failure modes in model-generated prose, they are specific enough for a model to apply consistently, and they are independent of domain. A rubric with ten dimensions is not ten times better than one with three; it is worse, because the model will score mechanically rather than reading carefully.
| Dimension | What it measures | Score 5 | Score 3 | Score 1 |
|---|---|---|---|---|
| Slop density | The proportion of sentences that contain no unique information, claim, or example. Hedge language, restatements, and filler qualify as slop. | Zero filler. Every sentence earns its place. | Under 10% filler. Readable but padded. | Over 25% filler. Majority of sentences are restatements. |
| Example density | The ratio of concrete, specific examples to abstract assertions. "AI tools can improve productivity" has zero example density. "Using a parameterized prompt cut the revision cycle from four rounds to one for our technical content team" has high example density. | Every abstract claim accompanied by a specific example. | Most claims have examples. A few assertions left hanging. | Mostly assertions. Reader must take claims on faith. |
| Argument clarity | Whether the central claim is statable in one sentence and whether the piece is organized to support it. A piece with high argument clarity has a thesis; a piece with low argument clarity is a list of loosely related observations. | Central claim statable in one sentence. Structure visibly supports it. | Central claim findable but buried. Some supporting structure. | No identifiable central claim. Observations not organized toward a conclusion. |
Sample output under review
Task: explain why RAG (retrieval-augmented generation) reduces hallucination for a non-technical audience. Illustrative output, presented as it might arrive before self-critique runs.
Draft outputRAG is a really interesting and powerful approach to reducing hallucinations in AI systems. It is worth noting that hallucinations are a significant problem in generative AI. RAG works by giving the model access to external information, which can be very helpful. This means the model can look things up rather than relying solely on what it has memorized during training. RAG can improve the accuracy of AI responses in many different contexts. In conclusion, if you are concerned about hallucinations, RAG is definitely something to consider exploring for your use case.
Rubric scores applied to this draft
| Dimension | Score | Evidence from the draft | Verdict |
|---|---|---|---|
| Slop density | 2 / 5 | "It is worth noting" (banned filler). "very helpful", "really interesting", "definitely something to consider" carry no information. Five of seven sentences contain no falsifiable claim. "In conclusion" restatement adds nothing. | Fail |
| Example density | 1 / 5 | Zero concrete examples anywhere. "External information" is undefined. No tool names (e.g. a customer-support bot querying a knowledge base), no numbers, no comparisons. The reader cannot picture what RAG actually does. | Fail |
| Argument clarity | 3 / 5 | A central claim exists (RAG reduces hallucination by consulting external sources) but it is spread across three separate sentences, buried under filler, and never stated in one clean sentence. A reader skimming would not extract it reliably. | Borderline |
| Total | 6 / 15 | Two dimensions below threshold of 4. Revision required on slop and example density before argument clarity can be assessed cleanly. |
For domain-specific work, you can add a fourth dimension without breaking the pattern. For code outputs, add "edge case coverage" (does the code handle empty input, type mismatches, and the documented boundary conditions?). For research summaries, add "evidence type specificity" (does the output name the type of evidence it cites rather than asserting claims as facts?). Keep the total at four or fewer.
Worked example: a prompt before and after the pattern
The clearest way to see the pattern's effect is to compare outputs from the same task with and without it. The task below is a common one: summarizing the tradeoffs of a technical decision for a non-technical audience.
When deciding between a vector database and a traditional relational database for semantic search, there are several important tradeoffs to consider. Both approaches have their advantages and disadvantages, and the right choice depends on your specific needs and requirements.
Vector databases are designed to handle high-dimensional data and are optimized for similarity search. They can be very powerful for use cases like semantic search. However, they may require more technical expertise to set up and maintain.
Traditional relational databases are well-established and widely understood. They are generally easier to manage for teams that already have SQL expertise. However, they are not natively optimized for semantic similarity queries, which can lead to performance issues at scale.
In conclusion, the best choice depends on your team's expertise, your performance requirements, and the specific nature of your search use case.
For semantic search on a catalog of more than roughly 100,000 items, a vector database outperforms a relational database on query speed by an order of magnitude. The reason is structural: relational databases compare text rows using exact or pattern-match logic (SQL LIKE or full-text indexes), while vector databases compare high-dimensional numerical representations of meaning, so a query for "lightweight rain jacket" also surfaces "packable windbreaker" without any keyword overlap.
The cost of that capability is operational complexity. Pinecone and Weaviate, two common choices, require maintaining a separate data pipeline to generate and index the embeddings whenever your catalog changes. A team already running PostgreSQL can add the pgvector extension and stay within a familiar system, accepting a query-speed tradeoff that only becomes a problem above roughly half a million items or under 200-millisecond latency requirements.
The clearest decision rule: if your PM can define "semantic" success with a list of five query-to-result pairs and those pairs require understanding synonyms or intent rather than keywords, benchmark pgvector first. Upgrade to a dedicated vector store if query latency exceeds your product's requirement on your actual catalog size, not before.
The second output is not just better formatted. It makes specific claims (100,000 items, 200-millisecond latency, Pinecone and Weaviate), gives the product manager a concrete decision rule, and arrives at a position rather than a hedge. The rubric forced the model to notice it had not done those things in its first draft and fix them before returning output.
What does an anatomy of a testable prompt look like?
The Self-Critique Rubric Pattern is most effective when the prompt around it is also well-structured. A rubric appended to a vague, unstructured prompt will raise the floor but cannot fully compensate for the prompt's other weaknesses. The four layers of a testable prompt build on each other.
This four-layer structure maps to a broader principle in the RAILS framework (the five-rule structure we cover in our complete prompt engineering guide): Role, Architecture, Instructions, Loop, Safety. The Self-Critique Rubric Pattern is the Loop component: it closes the feedback path so the model corrects its own output rather than waiting for you to notice the error and re-run. For a deeper look at the instruction layer specifically, see our guide on how to write a system prompt.
Why do ban lists outperform positive-only instructions?
Positive instructions set a direction; ban lists close the floor. Telling the model to "be specific" leaves the definition of specificity to the model's defaults, which vary. Telling it not to make a claim without citing the evidence type is binary: either the claim cites an evidence type or it does not. The model can self-check a ban list. It cannot easily self-check an abstract instruction to "be better."
The practical comparison: a prompt with the instruction "write concisely" will routinely produce outputs with unnecessary preamble because the model's training corpus contains enormous amounts of preamble and "write concisely" does not specifically forbid any of it. The same prompt with "do not begin with a sentence that restates the question" removes the most common form of preamble immediately and completely.
A few ban-list items are close to universal for prose prompts. The phrase "it is worth noting" is filler in every context because it provides no information about what follows. "In conclusion" is almost always a cue for a restatement that adds nothing. "Delve" and "tapestry" are statistical AI tells that signal training-corpus regurgitation. None of these are useful. All of them can be banned without any cost to output quality.
For code prompts, the ban list looks different: no SELECT *, no bare except: pass, no magic numbers without a named constant. The principle is the same: ban the patterns that signal shortcuts or low-quality defaults.
How do you set the right revision threshold?
The threshold is the minimum score on any dimension below which the model revises. Setting it to 4 out of 5 on all three dimensions is a reasonable starting point for most use cases. It is high enough to filter genuinely poor outputs and low enough that the model does not spin indefinitely on a task that is inherently ambiguous.
One useful calibration exercise: score ten of your existing outputs manually against your rubric. Find the score below which you have consistently rejected outputs in the past. Set your threshold there. If you have never rejected an output that scored 3/5 on slop density, your actual threshold for that dimension is probably 2, not 4, and an overly high threshold will waste model inference.
For longer or more complex tasks, you can add a maximum revision count: "revise at most twice." This prevents the model from entering a long revision loop on a task where the rubric criteria are in genuine tension with each other (high example density can reduce argument clarity if the examples overwhelm the structure).
How does this pattern differ from academic self-refinement research?
The Self-Critique Rubric Pattern shares surface structure with the self-refinement paradigm studied in AI research, but the two serve different goals and make different commitments. Madaan et al. (2023), "Self-Refine: Iterative Refinement with Self-Feedback", demonstrated that models can improve outputs through iterative self-generated feedback across tasks including math reasoning, code generation, and dialogue. Their setup uses separate generate, feedback, and refine steps, often across multiple rounds, with the model supplying natural-language critique as an intermediate artifact.
Three differences matter for practitioners. First, self-refinement as studied by Madaan et al. is a multi-call pipeline: generate, then critique, then revise, with the feedback step producing an intermediate string that is fed back into a new inference call. The Self-Critique Rubric Pattern collapses this into a single call by embedding the evaluation criteria up front. You trade iteration depth for operational simplicity. Second, academic self-refinement research typically leaves the quality criteria implicit, relying on the model's prior knowledge of what "good" means for the task. The rubric pattern makes the criteria explicit and numeric, which is what makes the threshold meaningful and repeatable. Third, self-refinement benchmarks focus on correctness improvements on tasks with verifiable answers (math, code). The three-dimension rubric here targets prose quality, where correctness is not a useful frame. The slop density, example density, and argument clarity dimensions are designed for this space specifically, not derived from the broader literature. The ban-list-as-floor argument, in particular, has no direct analog in the self-refinement literature; it is a practitioner invention that closes a gap the academic framing leaves open.
Does the Self-Critique Rubric Pattern work the same on every model?
The pattern works on every major current model but with different calibration requirements. Frontier models (GPT-4 class, Claude 3 Sonnet and above, Gemini 1.5 Pro) follow the rubric instructions reliably and produce coherent self-critiques. Smaller or quantized models follow the instruction format but their self-critique accuracy is lower: they will sometimes score a weak output as 5/5 and miss obvious problems. The rubric is most effective when the model has enough context-following capability to reliably apply a three-criteria framework.
The OpenAI model documentation and the Anthropic model overview both note that instruction-following performance correlates with model size. For self-critique rubrics specifically, this matters: a model that is borderline on instruction-following will follow the rubric format syntactically but apply the criteria loosely. If you are running on a smaller model and the rubric does not seem to be catching the problems you expect, simplify the rubric to a single dimension first and verify it is being applied before adding more.
On reasoning models (o-series from OpenAI, extended thinking from Anthropic), the internal chain-of-thought that precedes the visible output functions as a natural self-critique pass. You still benefit from appending the rubric, because it names the criteria explicitly rather than leaving them implicit, but the revision step is less frequently triggered because the model's generation quality is already higher.
When should a tested prompt be promoted?
The Self-Critique Rubric Pattern is the most important evaluation technique at the individual-prompt level. But at some point, running the same prompt repeatedly becomes an argument for a more durable structure. The useful signal is repetition: when you have run the same prompt, with minor variations, more than three times, the prompt has revealed what it needs to work reliably. That is the moment to promote it.
Promotion means four things: adding typed variable slots for every input that changed across runs, pinning a version number, writing down what the rubric threshold was and why, and naming the failure modes you observed before promotion. A promoted prompt is testable in a way a one-off prompt never is. You can run it against a consistent set of test inputs and track whether its scores degrade over time.
A prompt you have tested and scored is worth keeping and versioning. We put rated, versioned prompts and brains in BrainBoot's marketplace (our own Prompt OS) for exactly that reason. The marketplace is where promoted prompts that cleared a rubric threshold live alongside their documented test inputs and scores, so you can reuse them without rebuilding from scratch each time.
Frequently asked questions
What is the Self-Critique Rubric Pattern in prompt engineering?
How do you evaluate whether a prompt is working?
What is the difference between a prompt and a brain in prompt engineering?
Why do positive-only instructions fail in prompts?
What scoring dimensions should a prompt rubric include?
Bottom line
If you take one technique from this article, make it the Self-Critique Rubric Pattern. Append a rubric to your next prompt, set a threshold of 4 on each of the three dimensions, and observe what the model reports about its own draft before it revises. You will see, in plain text, exactly what it considered weak and what it changed. That diagnostic information alone is worth the two minutes it takes to add the rubric. From there, the progression is straightforward: identify your actual failure patterns, build a ban list from them, layer the four anatomy components around the rubric, and when you have run the prompt more than three times, promote it to a versioned, tested unit. That is the complete path from "hoping the output is good" to "knowing it is." For the full RAILS framework that this technique sits inside, see our complete prompt engineering guide.
- Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
- Madaan et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
- OpenAI Evals documentation (2024).
- Anthropic. Evaluating Claude: model evaluation documentation.
- Anthropic. Claude models overview (2026).
- OpenAI. Models and capabilities documentation (2026).