Model-specific prompting in 2026: how Claude, GPT-5, and Gemini respond differently to the same prompt
The short version: a prompt that extracts excellent output from Claude Sonnet can return thin hedging from GPT-5.5 and over-cited clutter from Gemini, when sent word-for-word with no adjustment. Not because any model is defective, but because each model family was shaped by a different training objective, a different safety-calibration philosophy, and a different intended use case. Understanding what each model actually rewards, not what its marketing page says, is the fastest way to stop leaving quality on the table. This article introduces the Per-Model Prompt Profile, a compact reference table for targeted edits, and shows three worked examples of the same base prompt adapted for each model family. It is part of our complete prompt engineering guide.
- Core claim: each model family rewards a different prompt architecture; generic prompts give generic results.
- The asset: the Per-Model Prompt Profile, a five-axis table telling you exactly what to adjust before switching models.
- The technique: name the model inside the prompt itself so any reviewer knows which engine produced the output, and so you build the discipline of targeting rather than broadcasting.
- Recommended defaults (2026):
claude-sonnet-4-6for structured analysis;claude-haiku-4-5for high-volume generation;gpt-5.5for tool-call chains; Gemini 2.5 Pro/Flash for grounded search tasks.
Table of contents
Why does the same prompt land differently on different models?
Three forces cause divergence.
Training data mix. Claude's training pipeline runs through a large amount of long-form written text and is heavily shaped by Constitutional AI, Anthropic's published method for instilling value alignment via self-critique. GPT-5 was shaped by enormous diversity of format and task, with heavy emphasis on tool use and code-function calls. Gemini was trained with deep integration of search-grounding capabilities and multi-modal inputs from Google's data infrastructure. The data mix is not just a curiosity; it is the substrate of the reward structure the model learned during fine-tuning.
RLHF objective divergence. Reinforcement Learning from Human Feedback (RLHF) fine-tunes the raw model against human rater preferences. Different labs use different rater guidelines, different populations, and different task distributions. The result: Claude raters penalized hedging and rewarded direct, confident, well-structured output; OpenAI raters rewarded helpfulness and code correctness; Google raters rewarded factual accuracy and source citation. These are not rumored internal practices; they are implied by the published behavior differences observable across thousands of prompt runs.
Safety calibration at different thresholds. Claude applies explicit harm-avoidance layers that will surface a refusal or a caveat at lower thresholds than GPT-5 on certain content categories, and at higher thresholds on others. Gemini's safety layer is more context-dependent and search-integrated. Knowing these thresholds is not about bypassing safety; it is about not accidentally triggering a refusal on a completely benign task by phrasing a constraint the wrong way for that model.
What is the Per-Model Prompt Profile?
The Per-Model Prompt Profile is a five-axis reference table that maps what each major model family rewards at the prompt level. It is not a cheat sheet of magic phrases. It is a structured description of the reward structure each model learned, expressed as actionable edits to a base prompt.
The five axes are: role framing, constraint style, output format response, self-critique uptake, and refusal handling. Together they cover the levers that produce the largest quality lift when targeted correctly. The table below shows the 2026 state across Claude (Sonnet and Opus 4.x series, per Anthropic's model documentation), GPT-5 (per OpenAI API documentation), and Gemini 2.x (per Google AI Studio documentation).
| Model family | Role framing | Constraint style | Output format | Self-critique uptake | Refusal handling |
|---|---|---|---|---|---|
| Claude 4.x (Haiku / Sonnet / Opus) |
Responds strongly to a named competency role. "You are a senior data engineer" outperforms "You are an assistant". Rewards explicit task framing over conversational openers. | Accepts a structured ban-list naturally. "Never use passive voice. Never fabricate numbers. Never say 'it is worth noting'." reads as instruction, not hostility. | Reliable on nested JSON schemas with exact keys. Rarely hallucinates extra keys on Sonnet/Opus. Haiku may add commentary outside the schema; suppress with a closer instruction. | Highest uptake of self-grading instructions. Adding "score your output 1-10 on argument clarity; if below 8, revise and re-score" produces measurable revision cycles on Sonnet and Opus. | Will push back on bad input rather than comply with silence. Lean into this: add "if the question is ambiguous or underspecified, state the assumption you are making before answering". |
| GPT-5.5 / 5.4 (standard / mini) |
Responds well to conversational role openers ("You are helping a fintech startup's growth team") and to explicit tool declarations ("You have access to a Python interpreter"). | Prefers inline cautions rather than ban-list blocks. Embedding constraints in the task sentence ("write three bullet points, avoid superlatives") outperforms a separate constraints section. | Strongest on tool-call schemas and function-definition JSON. Code generation quality noticeably higher than Claude on complex multi-file tasks. Respond well to markdown section headers when plain text output is wanted. | Moderate self-critique uptake. Revise loops work, but explicit "re-score" language sometimes produces acknowledgment rather than genuine revision. Pair with a specific dimension ("re-score only on technical accuracy"). | Complies more permissively on edge cases and is less likely to surface an assumption. Explicitly ask for assumption disclosure if you need it: "list any assumption you made about unclear input". |
| Gemini 2.5 (Flash / Pro) |
Responds to task-plus-tool declarations ("You are a research analyst. You have web access. Ground every claim in a source"). Multi-modal role framing ("you are reviewing this image alongside the document") activates capabilities other models lack. | Benefits from explicit grounding instructions as a constraint variant: "only cite sources you can verify; mark unverified claims [unverified]". Pure ban-lists are accepted but less naturalised than on Claude. | Strong on citation-rich outputs and structured summaries. Tends to over-cite on long answers without a count constraint ("cite no more than three sources per paragraph"). JSON schema output is reliable on Pro; Flash may drift on deeply nested schemas. | Self-critique uptake is moderate-high on Flash, lower on Pro when grounding is active. A re-score loop paired with a grounding check ("re-evaluate: does each factual claim have a source?") works better than a general quality rubric. | Refusal behavior is context-dependent and more variable than Claude. For borderline-adjacent but benign tasks, add explicit context ("this is for a journalism research project, not publication") rather than relying on the model to infer it. |
Table reflects published model behavior as of June 2026. Model behavior shifts with each release; verify against the primary docs linked above before assuming the table is current for a new model version.
What is each model family optimizing for?
Before adjusting a single prompt element, it helps to understand the architectural bet each lab placed. The three-way anatomy below surfaces the structural reason each model behaves the way it does, not a marketing summary of what the lab claims.
What are the five axes of the Per-Model Prompt Profile?
Each axis targets one structural element of a prompt that behaves differently across model families. Adjusting all five for a model switch takes under two minutes on a well-structured base prompt.
Axis 1: Role framing. This is the opening persona instruction. On Claude, specificity of competency matters more than warmth of framing. "You are a senior database engineer with eight years of PostgreSQL experience at high-traffic SaaS companies" outperforms "You are a helpful assistant who knows databases." On GPT-5, embedding the role in a conversational context ("You are helping a three-person fintech startup") activates a different kind of helpfulness. On Gemini, declaring the tools the model has access to alongside the role is more activating than role alone. Role framing is the "R" in the RAILS framework covered in our complete prompt engineering guide.
Axis 2: Constraint style. Claude handles a structured, block-formatted ban-list. GPT-5 handles inline constraints embedded in the task description more naturally. Gemini handles grounding constraints as a specific variant of instruction. The mechanism is the same in all three cases, explicit forbidden-pattern declarations, but the syntactic placement and format affects how reliably each model honors the constraint.
Axis 3: Output format response. All three models respond to JSON schema declarations, but fidelity differs by model and schema depth. Claude Sonnet and Opus maintain schema fidelity on deeply nested schemas reliably. GPT-5 is strongest on function-call schemas and code output. Gemini Pro is reliable on flat schemas; Flash drifts on deeply nested ones. Always include a fallback instruction ("if any field is unavailable, return null rather than omitting the key") on all three models.
Axis 4: Self-critique uptake. This is the highest-leverage axis and the least-used one. Chain-of-thought prompting (Wei et al., 2022, in the original chain-of-thought paper) established that asking a model to show its reasoning improves answer quality measurably. The self-critique loop extends this: you end the prompt with a scoring rubric and a revision instruction. Claude Sonnet and Opus have the highest uptake of this pattern. A prompt that ends with "score your answer 1-10 on factual precision and argument clarity; if either is below 8, revise and re-score before delivering the final answer" produces a measurable quality lift on Claude that you can observe in the output's length, specificity, and internal consistency.
Axis 5: Refusal handling. This is the "S" (Safety) axis of the RAILS framework. Claude will surface a refusal on ambiguous or underspecified input, which is useful: it tells you your prompt was unclear. GPT-5 will more often produce a plausible-sounding output without flagging the ambiguity. Gemini's refusal behavior is the most variable and context-dependent. For professional workflows, Claude's refusal behavior is the most useful because it surfaces prompt defects that GPT-5 would silently paper over. On Claude, lean into this by adding "if the input is ambiguous, state your assumption explicitly before answering" rather than trying to engineer around it.
Three worked examples: the same task, three model-tuned versions
The base task: generate a competitive positioning summary for a B2B SaaS product. Below are three versions of the same prompt, each tuned to the model it is targeting. The changes are deliberate and minimal; the goal is to show the targeted edits, not rewrite the entire prompt from scratch on every switch.
Before the full worked examples, the table below isolates the four prompting deltas that matter most in practice when you switch between the current flagship of each family. Model IDs are verified against primary documentation as of June 10, 2026. Behavioral descriptions are based on observed prompt performance, not on vendor-published benchmarks; framing reflects mid-2026 defaults and will shift as models are updated.
| Prompting delta | Claude Sonnet 4.6 / Opus 4.8 claude-sonnet-4-6 / claude-opus-4-8 |
GPT-5.5 / GPT-5.4 gpt-5.5 / gpt-5.4 |
Gemini 2.5 Pro / Flash gemini-2.5-pro / gemini-2.5-flash |
|---|---|---|---|
| System-prompt weightHow much the system prompt governs output vs. the user turn | High weight System-prompt instructions are treated as a binding contract. Constraints set in the system prompt hold reliably through long multi-turn conversations. Sonnet and Opus 4.x rarely drift from a system-level ban-list even after 20+ turns. Place your most critical constraints here, not in the user turn. |
High weight, tool-aware System prompt governs well but GPT-5.x was optimized for tool-augmented pipelines, so instructions that reference tool definitions or function schemas activate more reliably than pure natural-language directives. If your system prompt does not mention tools, consider whether the user turn is the better place for task-level constraints. |
Moderate weight (grounding can override) System-prompt instructions hold, but Gemini's grounding layer may surface information that contradicts a system-level constraint when web access is active. Add explicit grounding-scope instructions ("only use grounding for factual claims, not for tone or format") if your system prompt sets strict output rules. Illustrative: based on observed behavior; verify with current Gemini API docs. |
| Format adherenceHow reliably the model follows an explicit output schema or structure instruction | Highest fidelity on nested JSON Claude Sonnet 4.6 and Opus 4.8 hold exact-key JSON schemas reliably, including deeply nested objects, across long outputs. Haiku 4.5 is reliable on flat schemas; add a closer instruction ("return only the JSON object, no commentary before or after") to suppress prose wrapping on shorter models. |
Strongest on function-call schemas GPT-5.x resolves OpenAI function-calling and structured output schemas with high fidelity. For plain JSON output without the function-call layer, add a tight instruction ("return a raw JSON object with no markdown fences or preamble"). Markdown-section output is also very reliable when headers are named explicitly in the prompt. |
Reliable flat schemas; deep nesting drifts on Flash Gemini 2.5 Pro holds flat and moderately nested JSON schemas reliably. Gemini 2.5 Flash may add commentary outside the schema on deeply nested outputs; add "return only the JSON object; do not add any text before or after" as a closing instruction. Pro is the safer choice for schema-sensitive production tasks. Illustrative for Flash nesting behavior; verify against current docs. |
| Verbosity defaultsHow much output the model generates without an explicit length constraint | Calibrated to task scope Claude Sonnet 4.6 generally matches output length to task complexity without padding. Opus 4.8 trends longer on reasoning-heavy tasks because its adaptive thinking layer produces visible reasoning before the final answer. If you want a compact answer from Opus, add "respond concisely; do not show intermediate reasoning unless I ask for it." |
Defaults to thorough, structured output GPT-5.5 produces comprehensive, well-organized responses by default. This is an asset on complex tasks but can generate excess structure (headers, sub-headers, bullet nesting) on simple tasks. Add "use plain prose, no headers" or set an explicit word-count target if you want a lighter output. GPT-5.4 is somewhat more terse by default. |
Over-cites without a count constraint Gemini 2.5 Pro and Flash front-load source citations on grounded outputs, which can bury the core argument. Always add a source-count cap ("cite no more than two sources per section") and a length target on grounded tasks. On non-grounded tasks, verbosity is closer to Claude's default. Omitting the count constraint is the single most common Gemini prompting error in production workflows. |
| Handling underspecified inputWhat the model does when the prompt is ambiguous or missing key context | Surfaces ambiguity, asks or flags Claude will refuse or caveat on underspecified input rather than fabricate an answer. Lean into this: add "if the input is ambiguous, state the assumption you are making before answering." This turns a potential refusal into a structured assumption-disclosure, which is more useful than a silent fabrication. |
Complies, flags less GPT-5.x tends to produce a plausible answer on underspecified input rather than flagging the gap. This looks like helpfulness but can silently paper over prompt defects. Add "list any assumption you made about unclear input at the end of your response" as a trailing instruction to surface the assumptions that GPT-5 would otherwise make invisibly. |
Variable; grounding can mask gaps Gemini's behavior on underspecified input is the most variable of the three families. When grounding is active, it may use web results to fill in context you did not provide, which is sometimes useful and sometimes an unintended substitution. Add "do not infer input context from web sources; if context is missing, state what is missing" to lock this down. Illustrative; verify with current Gemini API docs. |
Model IDs verified against Anthropic docs (claude-sonnet-4-6, claude-opus-4-8, claude-haiku-4-5), OpenAI API docs (gpt-5.5, gpt-5.4, gpt-5.4-mini), and Google AI Studio docs (gemini-2.5-pro, gemini-2.5-flash) as of June 10, 2026. Behavioral descriptions are based on observed prompt performance, not vendor-published benchmarks. Cells marked "Illustrative" describe patterns observed in practice but not independently verified against released documentation for that specific behavior; treat them as starting-point guidance and test against your task. Model behavior changes with each release; re-verify before assuming this table applies to a new model version.
You are a senior B2B positioning strategist with deep experience applying April Dunford's positioning framework (from "Obviously Awesome") to SaaS products in competitive markets. CONSTRAINTS (follow exactly): - Never use the words "powerful", "seamless", "intuitive", or "game-changer". - Never write passive voice. - Never fabricate market share percentages or analyst citations. - Never use em-dashes. - Mark any claim you cannot verify as [unverified]. OUTPUT SCHEMA (JSON, exact keys, no extras): { "competitive_alternatives": [string, string, string], "unique_attributes": [string, string], "value_themes": [string, string], "positioning_statement": string, "assumptions_made": [string] } SELF-CRITIQUE: After generating the output, score it 1-10 on (a) specificity: does each claim name a concrete differentiator or use a vague label? (b) honesty: are unverifiable claims marked? If either score is below 8, revise and re-score before returning the final JSON. REFUSAL: If the product description is too vague to produce honest positioning, state what additional information is needed rather than producing generic output.
You are helping a B2B SaaS startup's go-to-market team produce a competitive positioning summary. The team is preparing for a board presentation and needs precise, honest output. Apply April Dunford's positioning framework (from "Obviously Awesome"). Do not fabricate analyst citations or market share numbers. Avoid superlatives like "powerful", "seamless", and "intuitive". List any assumption you made about unclear input at the end. Return a JSON object with exactly these keys: - competitive_alternatives (array of strings) - unique_attributes (array of strings) - value_themes (array of strings) - positioning_statement (string) - assumptions_made (array of strings) If you cannot verify a factual claim, mark it [unverified] inside the string.
You are a research analyst with web access. Your task is to produce a competitive positioning summary for a B2B SaaS product, grounded in verifiable market information. Apply April Dunford's positioning framework (from "Obviously Awesome"). GROUNDING RULE: For any market or competitive claim, cite a specific source (URL or publication name + date). If no source is available, mark the claim [unverified]. Cite no more than two sources per section. Do not fabricate citations. Return a JSON object with exactly these keys: - competitive_alternatives (array of strings, each with a source or [unverified]) - unique_attributes (array of strings) - value_themes (array of strings) - positioning_statement (string) - sources_used (array of {claim: string, source: string}) After generating, verify: does every factual claim in competitive_alternatives have a source entry in sources_used? If not, revise before returning.
Why should you name the model inside the prompt itself?
Naming the recommended model at the prompt level, as a metadata field or a header comment, does two things that compound over time.
First, it creates an audit trail. When output quality drops after a platform silently upgrades its default model, you have a record of which model the prompt was calibrated for. RAG pipelines (Lewis et al., 2020, in the original RAG paper) and multi-agent frameworks (Yao et al., 2022, in the ReAct paper) both benefit from this discipline because the routing logic can use the model flag to dispatch to the correct engine rather than defaulting to whatever the platform exposed last week.
Second, it builds the discipline of treating prompts as versioned artifacts with a specific target, not as generic text that works anywhere. Once you write "recommended_model: claude-sonnet-4-6" or "# model: gpt-5" at the top of a prompt, you have acknowledged that the prompt is a targeted document. That acknowledgment changes how you maintain and improve it.
In 2026, the recommended_model discipline maps roughly as follows: claude-sonnet-4-6 for structured analytical and writing tasks where schema fidelity and self-critique uptake matter; claude-haiku-4-5 for high-volume generation at lower cost where the task is well-specified enough not to need the self-critique loop; gpt-5.5 for multi-step tool-call pipelines and complex code generation (gpt-5.4 for the same workloads at lower cost); gemini-2.5-pro for search-grounded research tasks and multi-modal document analysis; gemini-2.5-flash for high-volume grounded generation where Pro's latency is a constraint.
What is the 2026 timing window for model-specific prompting?
Model-specific prompting matters more in mid-2026 than it did in 2023 for a structural reason: the models have diverged further, not converged. In 2023, GPT-4, Claude 2, and Gemini Ultra were reasonably close in behavior. By mid-2026, the Claude 4.x series, GPT-5.x, and Gemini 2.5 each have measurable differences in output quality on targeted task types, not just speed or price. This is not a permanent state. As the labs converge on capability and as users increasingly abstract prompts behind orchestration frameworks, the behavioral gap between models may shrink. The window for getting outsized lift from model-specific tuning is now, not in three years when prompting conventions have standardized.
The other timing factor is cost. In 2026, the cost gap between a flagship model and a competent tier-two model is large enough to matter on production workloads. A prompt that works well on claude-haiku-4-5 does not need to run on Opus. Similarly, gpt-5.4 costs less than gpt-5.5 on the same pipeline. Model-specific prompting lets you maintain quality while routing to the cheapest model that passes the task's quality bar. That is not a marginal optimization; on a workflow running 50,000 calls a month it is the difference between a sustainable and an unsustainable infrastructure cost.
For a deeper analysis of each model family's strengths beyond the prompting layer, including benchmark context and use-case fit, our ChatGPT vs Claude vs Gemini comparison covers the model-choice decision from a buyer's perspective rather than a prompting one.
Who does model-specific prompting not apply to?
Honest anti-recommendation. Not every workflow benefits from this level of tuning.
- One-off queries and casual use. If you are asking a single question, the overhead of maintaining a Per-Model Prompt Profile is not justified. Generic well-structured prompts are sufficient.
- Workflows locked to a single model. If your stack is 100% Claude because of a platform integration or a security policy, model-specific tuning collapses to Claude-specific tuning, and the comparative axis of the Per-Model Prompt Profile becomes irrelevant. Focus on the self-critique loop and constraint style for Claude alone.
- Teams without prompt version control. If your prompts are undocumented strings in application code with no version history, adding model flags will not help because there is no infrastructure to use them. Fix the version control problem first.
- Tasks where model quality is not the bottleneck. If your output quality problem is actually a data-quality or input-specification problem, switching models and tuning prompts will not fix it. Diagnose before optimizing.
Frequently asked questions
Does the same prompt work equally well on Claude, GPT-5, and Gemini?
What is a Per-Model Prompt Profile?
Which model is best for prompt engineering in 2026?
Should I use different system prompts for each model?
What is the recommended model for production prompt engineering workflows?
Bottom line
The Per-Model Prompt Profile is a five-minute investment before switching model families: adjust role framing, constraint style, output format, self-critique instruction, and refusal handling. On well-specified tasks, the quality gap between a generic prompt and a targeted one is larger than the quality gap between the models themselves. Claude Sonnet 4.6 rewards the most structured, constraint-heavy prompts with the highest self-critique uptake. GPT-5.5 rewards conversational context and tool declarations; GPT-5.4 is the cost tier for the same pipeline. Gemini 2.5 Pro rewards grounding instruction and source-count constraints. Name the model inside the prompt. Treat prompts as versioned documents with a target, not as text that works everywhere. That discipline is what separates a one-off query from a reusable cognitive unit, which is exactly where prompt engineering goes from a skill into a system. For the full RAILS framework and every technique covered in this series, start with our complete prompt engineering guide.
- Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv.
- Yao et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv.
- Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv.
- Anthropic (2022). Constitutional AI: Harmlessness from AI Feedback.
- Anthropic. Claude models overview. Anthropic documentation. verified Jun 2026
- OpenAI. API platform documentation. verified Jun 2026
- Google. Gemini API documentation. Google AI Studio. verified Jun 2026