How do you prevent an LLM from hallucinating in a RAG pipeline?

Three prompt-level controls reduce hallucination in RAG pipelines. First, include an evidence-type instruction: tell the model to cite only from the provided context blocks and to mark any claim it cannot support with [unverified] rather than inferring. Second, include a refusal clause: instruct the model to respond with a structured absence notice when the retrieved context does not contain enough information to answer the question, rather than filling the gap from training memory. Third, include a self-critique step at the end of the output structure: ask the model to list any claim in its response that is not directly supported by a cited passage, then revise.

What is the RAILS framework for prompt engineering?

RAILS is Nesyona's re-expression of five structural properties that separate a strong, reusable prompt from a one-off instruction. R stands for Role: a specific, competence-anchored persona rather than a generic one. A stands for Architecture: the structured anatomy of a reusable prompt, including system instructions, execution rules, output format, domain protocols, and invariants. I stands for Instructions with forbidden patterns: explicit ban-lists on slop phrases, prohibited SQL patterns, or off-brand language. L stands for Loop: a self-critique and re-scoring clause at the end of the prompt. S stands for Safety: a refusal clause that tells the model to push back on bad input rather than comply silently. This article teaches the A layer: context architecture and how RAG fits into it.

Prompt Engineering Updated June 2026 · 12 min read · Part of the RAILS prompt engineering series

RAILS Series · Architecture Layer (A)

Context and RAG prompting: how to feed an LLM the right information

Q: What is a Context Budget Map?

A Context Budget Map is a three-tier classification that decides what information belongs in the static system prompt, what should be retrieved at query time, and what should be omitted entirely. Tier 1 (in-prompt) holds stable, always-relevant instructions: role definition, output format, forbidden patterns, self-critique rubric. Tier 2 (retrieved) holds dynamic, query-specific content: product records, knowledge-base passages, policy documents, recent news. Tier 3 (omitted) holds noise that wastes tokens and degrades response quality: full database dumps, redundant history, speculative context the model cannot verify.

The single most common reason a well-written prompt produces a mediocre response is not the instructions. It is the context. Either the model is working from nothing when it needs specifics, or it is drowning in a wall of text when it needed two paragraphs. Retrieval-augmented generation (RAG), introduced by Lewis et al. 2020, gave the field a vocabulary for this: rather than relying on a model's parametric memory, you retrieve the right documents and inject them into the prompt at query time. But RAG is an architecture, not just a technique, and understanding the architecture is what separates a prompt that works once from a prompt system that works reliably at scale. This piece is part of our complete prompt engineering guide and covers the A layer of the RAILS framework: prompt anatomy and the context decision.

Last reviewed: June 2026 Next review: December 2026

Bottom line up front

What RAG actually does: it splits your knowledge into a retriever (finds relevant passages) and a generator (writes from them), so the model reads live, specific information instead of guessing from training memory.
The core decision: every piece of information you could include in a prompt falls into one of three tiers: in-prompt (stable instructions), retrieved (dynamic facts), or omitted (noise). The Context Budget Map makes that decision explicit.
Why it matters for prompt quality: a context window is a finite budget. Filling it with the wrong content crowds out the right content; structured retrieval and honest anti-fabrication clauses are the two highest-leverage fixes.

Table of contents

What retrieval-augmented generation is
Anatomy of a reusable prompt
The Context Budget Map
A worked RAG system prompt
Structured output contracts for RAG
Anti-fabrication: the evidence-type clause
The self-critique loop
Which model to use for RAG pipelines
Common context mistakes
FAQ
Bottom line

What is retrieval-augmented generation, in plain terms?

A language model knows what it was trained on, and nothing more. Ask it about a product you launched last month and it will invent an answer that sounds confident and is completely wrong. Ask it to summarize a document you just wrote and it cannot see the document unless you paste it in. Retrieval-augmented generation is the architectural pattern that solves this by splitting the work between two components.

The retriever is a lookup system. When a query arrives, the retriever searches a vector store, a database, or a knowledge base and returns the most relevant passages, typically scored by semantic similarity. The generator is the language model. It receives those retrieved passages as part of its input, alongside the user's query and the standing instructions, and it writes its response by reading and synthesizing the retrieved text. In the original formulation by Lewis et al. 2020 (NeurIPS), RAG was presented as a way to give "open-domain question answering" access to a non-parametric memory bank. The same structural insight applies at every scale, from a tiny product-FAQ chatbot to an enterprise knowledge assistant.

For a prompt engineer, RAG is not primarily a machine learning problem. It is a context architecture problem. The questions that matter are: what information should arrive in the static system prompt, what should be retrieved fresh for each query, and what should be kept out of the context window entirely? Those three questions are what the Context Budget Map answers.

What is the anatomy of a reusable prompt?

A reusable prompt is not a one-off instruction. It is a structured document with five distinct layers, each serving a different function. Conflating the layers is the most common reason prompts degrade over time: someone adds a new instruction to the wrong layer, and the whole thing starts contradicting itself.

The five-layer anatomy is the core of the A (Architecture) component in the RAILS framework. Think of it as the engineering drawing for a cognitive unit you intend to use hundreds of times.

Layer	What it holds	Changes how often	Where it lives
System Prompt	Role definition, output language, tone, persona. The stable identity the model adopts for the entire session.	Version-pinned; changes only on deliberate revision	System message or top of context
Execution Rules	Priority-ordered instructions: what to do, what to do first, what to do if data is missing. The task logic.	Version-pinned with the system prompt	Numbered or labeled list in system message
Output Format	Exact schema: required keys, section headers, field order, field types, fallback for missing data. Not "be structured" but literally the structure.	Version-pinned; breaking changes = version bump	Separate section in system message, or appended to instructions
Domain Protocols	Subject-specific rules: citation format, legal disclaimers, SQL safety checks, brand vocabulary, forbidden competitor names.	Updated when domain rules change	Appended to execution rules or a named section
Invariants	Things that can never be overridden by any user input or retrieved content: the self-critique clause, the refusal clause, the evidence-type constraint, the anti-hallucination instructions.	Never, by design. If something must be overridable, it is not an invariant.	Final section of system prompt, after all other layers

The context window that a RAG pipeline fills falls between the Output Format and Domain Protocols layers, or is injected as a named user-turn block immediately before the actual query. It is not the system prompt. Treating retrieved content as if it were a standing instruction is a category error that causes the model to treat dynamic, potentially wrong information as authoritative standing rules.

What is the Context Budget Map?

The Context Budget Map is a three-tier decision framework that assigns every candidate piece of information to one of three bins: in-prompt, retrieved at query time, or omitted entirely. The premise is that a context window is a finite budget, and the question of what to include is an allocation problem, not an assembly problem. Dropping everything in and hoping the model sorts it out is not a strategy; it is a symptom of not having thought through the architecture.

We find this framing more useful than the common instruction to "just keep your prompts concise," because brevity for its own sake strips context that the model genuinely needs, while this framework forces you to justify each piece of content by its function.

Context Budget Map: three-tier decision diagram for context window allocation. Tier 1 = static system message; Tier 2 = query-time retrieval; Tier 3 = omit entirely.

The most common Tier 3 violation we see is inherited conversation history. A chat interface that appends every prior turn grows the context indefinitely, and by turn twenty, roughly half the context window is spent on resolved sub-questions and discarded drafts. A disciplined RAG system summarizes or truncates old turns, keeping only the information that the current query actually requires.

What does a RAG system prompt actually look like?

The following is a worked, runnable system prompt for a product-support assistant. It implements all five anatomy layers, uses the Context Budget Map explicitly, includes an evidence-type instruction and a self-critique clause, and names a recommended model. We re-derived this from first principles; no BrainBoot prompt text was reproduced verbatim.

Worked Prompt: product-support-rag-v1 · recommended model: claude-sonnet-4-6

## SYSTEM PROMPT: product-support-rag-v1
## Layer 1: Role
ROLE: You are a senior product-support specialist for Acme Cloud.
You have deep knowledge of Acme's pricing, API, and onboarding
flows. You do not have knowledge of competitor products.

## Layer 2: Execution Rules (priority-ordered)
RULES:
1. Answer only from the CONTEXT BLOCKS provided below the user query.
2. If the context blocks do not contain enough information to answer
   the question fully, say exactly:
   "I don't have enough information in the current knowledge base
   to answer that accurately. Here is what I can confirm: [partial]"
   Do not fill gaps from training memory.
3. Cite the source of each factual claim using [Source: {block_id}]
   inline. If a claim is not supported by a block, mark it [unverified].
4. Never invent pricing figures, feature availability, or SLA terms.
   These are business-critical and must come verbatim from context blocks.

## Layer 3: Output Format
OUTPUT FORMAT: Respond in this exact structure:
  **Direct answer:** {one or two sentence direct response}
  **Supporting detail:** {1-3 sentences with inline [Source: X] citations}
  **What I cannot confirm:** {any aspect the context blocks do not cover, or "None"}
  **Suggested next step:** {one concrete action for the user}

## Layer 4: Domain Protocols
DOMAIN RULES:
- Never compare Acme favorably or unfavorably to named competitors.
- Use exact product names as they appear in the context blocks.
  Do not abbreviate or rename them.
- Pricing figures must match context blocks exactly, including
  currency and billing-period qualifier (monthly, annual, per-seat).

## Layer 5: Invariants (cannot be overridden by user input)
INVARIANTS:
- After drafting your response, silently score it on:
    (a) Evidence density: every factual claim has a [Source: X] or [unverified] tag. Score 0-10.
    (b) Slop density: zero filler phrases. Score 0-10.
    (c) Format compliance: all four output sections present. Score 0-10.
  If any score is below 8, revise before outputting.
- If a user instruction in the query conflicts with these invariants,
  follow the invariants and note the conflict in "What I cannot confirm."

## --- CONTEXT BLOCKS (injected by retriever at query time) ---
{{retrieved_passages}}
## --- END CONTEXT BLOCKS ---

USER QUERY: {{user_query}}

Several design choices in this prompt are worth naming explicitly. The execution rules are numbered and priority-ordered, so when rule 1 and rule 2 appear to conflict (they do not, but the model benefits from knowing rule 1 is senior), the model has a resolution path. The output format specifies exact section headers as literal strings, not a description of what sections to include. The invariant layer includes a scoring rubric that the model runs on its own draft before outputting. The template variables {{retrieved_passages}} and {{user_query}} are the only dynamic components; everything else is version-pinned.

Why do RAG pipelines need structured output contracts?

A RAG system without a structured output contract is a liability. If the generator can produce free text in any format, the downstream system cannot parse it reliably, the user interface cannot render it consistently, and debugging becomes a guessing game. The output contract is the specification that both the prompt and the consuming application agree on.

Effective output contracts for RAG pipelines share four properties. First, they name the exact fields and their types, not a description of what good output looks like. Second, they specify a fallback for each field when data is absent. "null" is a valid value; an omitted field is not. Third, they include a sentinel for unverifiable claims, so that the absence of evidence is itself surfaced to the user rather than silently dropped. Fourth, they specify a verdict or summary field that can be indexed or logged independently of the full response body.

The pattern we return to most often for RAG outputs is a fixed-header structure rather than JSON, because fixed headers are more readable in chat interfaces and more robust to partial-output failures. The four-section format in the worked example above is a concrete instance: direct answer, supporting detail with citations, what the model cannot confirm, and a suggested next step. A downstream application can parse this with a single regex split on the bold-label markers without depending on JSON validity.

For applications that do need machine-readable JSON, name the schema in the system prompt with explicit key names and an example of the fallback. "verdict: ship|block|needs-changes" is more useful than "verdict: a string describing the outcome." Exact enumeration values let the consuming code skip an LLM-output parsing layer entirely. The OpenAI Structured Outputs spec and Anthropic's consistency documentation both formalize this at the API level for models that support schema-constrained generation.

How do you stop an LLM from inventing facts it was not given?

The evidence-type clause is the most direct prompt-level control against fabrication in RAG pipelines. The clause does two things: it tells the model what type of evidence to bring to each claim, and it gives the model an explicit, non-embarrassing way to admit when it does not have enough evidence. Without that second part, the model's default behavior is to fill gaps from training memory rather than flag them, because filling gaps is what training rewarded.

The specific wording matters. "Do not make up information" is weaker than "cite every factual claim with [Source: block_id] from the provided context blocks; mark any claim you cannot cite as [unverified]." The first is a prohibition. The second is an instruction that also provides a concrete action for the unavoidable cases where the retrieved content is insufficient. A model that knows it can produce "[unverified]" is less likely to hallucinate than a model that has been told only what not to do.

This approach is consistent with the epistemic discipline that good research practice demands. It is what distinguishes a tool that is useful in high-stakes domains from one that generates fluent but unreliable text. Our colleagues at EduBracket have catalogued the prompt engineering courses that spend the most time on this layer, if you want a structured curriculum for the topic.

A note on benchmark claims: you will find articles about RAG that cite specific accuracy improvement percentages, such as "RAG reduces hallucination by 38 percent." We do not reproduce those figures here because they are task-specific and rarely comparable across implementations. The original Lewis et al. 2020 paper reports results on specific open-domain QA benchmarks; generalizing those numbers to your production pipeline is not supported by the research. What we can say with confidence is that grounded generation from retrieved text produces fewer fabricated facts than ungrounded generation from training memory, on the task classes where RAG is designed to apply. Run your own evaluation on your own data.

What is the self-critique loop, and why is it the highest-leverage prompt addition?

The self-critique loop is an instruction at the end of a prompt that asks the model to score its own draft output against an explicit rubric before producing the final response. It is the L in the RAILS framework. We call it the highest-leverage single addition because it costs nothing extra in terms of prompt design complexity but recovers a significant fraction of the quality losses that come from a model producing its first-pass answer without revision.

The mechanism is not mysterious. Language models produce better output when they have a chance to review and revise, just as human writers do. The self-critique loop exploits the same generation capacity the model uses for everything else, directed at its own output. The key implementation detail is that the rubric must be explicit and numeric: "score from 0 to 10 on evidence density" is actionable; "check that your answer is good" is not.

Three rubric dimensions cover most RAG pipelines: evidence density (every factual claim is cited or marked [unverified]), format compliance (all required output sections are present with correct field names), and slop density (no hedge phrases, no filler, no unsolicited caveats that the user did not ask for). A score below 8 on any dimension triggers a revision pass before the model outputs the final response. In the worked system prompt above, this lives in the Invariants layer, which means user instructions cannot turn it off.

The research lineage for this kind of multi-step generation includes chain-of-thought prompting from Wei et al. 2022, which demonstrated that models produce more accurate answers when they reason through intermediate steps, and the ReAct framework from Yao et al. 2022, which interleaves reasoning and action steps. The self-critique loop is a lighter-weight variant applied within a single generation: reason about the output, then revise it. Our chain-of-thought prompting guide covers the broader research lineage and the practical application patterns.

Does the RAG prompt architecture change for different models?

Yes, and the differences are large enough to matter in production. Three model families dominate serious RAG deployments as of mid-2026: the Claude 4.x family from Anthropic, GPT-5 from OpenAI, and Gemini 2.x from Google. Each handles long-context retrieval differently, and the prompt architecture that extracts peak performance from one is not always optimal for another.

For task classes that require careful instruction-following on complex multi-step schemas, the Claude Sonnet 4 series handles long context more consistently at mid-range costs, which is why it is the recommended model for the worked system prompt above. For tasks requiring very high-volume throughput where the query is simpler and the context block is short, a lighter model like Claude Haiku 4.5 reduces cost without meaningfully degrading quality on well-structured prompts. For tasks where the generating model needs to interact with external tools during generation, the GPT-5 tool-use API or Anthropic's tool-use spec are both strong choices.

The practical rule: run the same system prompt and the same set of test queries against at least two models before committing to a production architecture. Benchmark numbers from papers generalize poorly to specific task domains; your own evals on your own data are the only reliable signal.

What are the most common context mistakes in practice?

After auditing a range of production RAG prompts, we see the same failure modes reappear. None of them are subtle.

Injecting retrieved content into the system prompt rather than the user turn. The system prompt is for standing instructions. Retrieved content changes per query. Putting dynamic, potentially incorrect information in the system prompt gives it the epistemic weight of a standing instruction, which is exactly the wrong signal.

Passing more retrieved passages than the context window can handle well. Longer is not always better. Research on lost-in-the-middle effects in long-context models shows that information at the very beginning and very end of the context is recalled more reliably than information buried in the middle. If you are injecting fifteen retrieved passages and only three are actually relevant, you may be actively hiding the relevant content. Top-3 to top-5 passages with a high similarity threshold generally outperform top-15 with a permissive threshold.

No delimiter between the retrieved context and the user query. If the model cannot clearly identify where the retrieved text ends and the user's question begins, it may treat parts of the retrieved text as part of the question, or vice versa. Use a named delimiter (a section header, an XML-style tag, or a repeated separator line) and be consistent about it across all prompts in the system.

No version control on the system prompt. A RAG pipeline that works today has a system prompt version somewhere. When it stops working next month, you need to know what changed and when. Pin a version identifier in the system prompt itself (the product-support-rag-v1 in the worked example above is deliberate), and track changes in a changelog alongside the prompt file.

When you run the same system prompt three or more times a week, that is the signal to promote it into a version-pinned, parameterized reusable unit. The worked prompt above uses template variables for exactly this reason; the instructions are stable and the inputs are swapped out per query. That progression, from one-off instruction to parameterized, version-controlled cognitive unit, is the core of what prompt template discipline formalized looks like in practice.

Get the RAILS template pack: the full five-layer system prompt skeleton, the Context Budget Map worksheet, and the self-critique rubric as a fillable PDF.

How this guide was built

Primary sources: Lewis et al. 2020 (RAG, NeurIPS), Wei et al. 2022 (chain-of-thought, NeurIPS), Yao et al. 2022 (ReAct, ICLR). OpenAI Structured Outputs documentation and Anthropic prompt engineering documentation, captured June 2026.
Forkable expertise: The Context Budget Map, five-layer anatomy, and worked system-prompt structure are re-derived from first-principles prompt engineering practice developed in building BrainBoot, the prompt OS we built at DeepSynthesis. No prompt text was reproduced verbatim; all worked examples were written fresh for this article.
Benchmark policy: We do not reproduce specific numeric accuracy claims from RAG papers and generalize them to production pipelines. Task-specific benchmark numbers are cited with their original source and described as such. Run your own evals.
Last verified: June 2026. Model API behavior and documentation change frequently; check the linked Anthropic and OpenAI docs for the current state.

Frequently asked questions

What is retrieval-augmented generation (RAG) in prompt engineering?

Retrieval-augmented generation (RAG), introduced by Lewis et al. 2020, is a pattern where a system retrieves relevant documents from an external store and injects them into the prompt at query time. The model reads the retrieved text as part of its input and grounds its response in that evidence. In prompt-engineering terms, RAG is an architecture decision about what information arrives in the context window versus what lives in storage and is fetched on demand.

What is a Context Budget Map?

A Context Budget Map classifies every candidate piece of information into one of three tiers: Tier 1 (in-prompt) for stable standing instructions, Tier 2 (retrieved) for dynamic query-specific content fetched at runtime, and Tier 3 (omitted) for noise that wastes tokens and degrades quality. The framing forces you to justify each piece of content by its function rather than defaulting to including everything.

What is the difference between a system prompt and a RAG context block?

A system prompt carries stable instructions that apply to every query: the role, the output format, the forbidden patterns, and the self-critique clause. A RAG context block carries dynamic, query-specific content injected at runtime from the retrieval step. Both live in the context window, but the system prompt is version-pinned and authored once, while the context block is assembled fresh per query.

How do you prevent hallucination in a RAG pipeline at the prompt level?

Three prompt-level controls reduce hallucination: an evidence-type instruction (cite from the provided context blocks and mark uncitable claims [unverified]), a refusal clause (produce a structured absence notice when context is insufficient rather than filling from training memory), and a self-critique step (score evidence density and revise before outputting). None of these eliminates hallucination entirely, but they reduce the probability and surface it explicitly when it does occur.

What is the RAILS framework?

RAILS is Nesyona's named framework for the five structural properties of a strong, reusable prompt: Role (specific, competence-anchored persona), Architecture (five-layer prompt anatomy), Instructions with forbidden patterns (explicit ban-lists), Loop (self-critique and re-scoring clause), and Safety (refusal clause). This article covers the A layer. The full guide is at our complete prompt engineering guide.

Bottom line

RAG is not a black box you point at a database. It is a context architecture decision: what information belongs in the stable system prompt, what should be retrieved dynamically, and what should stay out of the context window entirely. The Context Budget Map makes that allocation explicit. The five-layer prompt anatomy gives each type of content a home. The self-critique loop catches the quality gaps before the response reaches the user. And the evidence-type clause turns "do not make things up" from a prohibition into an actionable instruction that gives the model a path for handling uncertainty honestly.

If you are building a RAG pipeline from scratch, start with the worked system prompt above, substitute your domain protocols in Layer 4, and run your own evaluation set before promoting it to production. Version-pin the system prompt and track changes. If the same prompt runs three or more times per week, promote it to a parameterized unit with named template variables and a version number. That discipline is the difference between a prompt that works once and a prompt system that works reliably. For the role and persona layer of RAILS, see our role and persona prompting guide. For how to evaluate whether a prompt is actually working, see how to evaluate prompts. And for the complete RAILS series, start at our complete prompt engineering guide.