The short answer
You cannot eliminate prompt injection with a model setting or a filter — there is no reliable model-level fix as of 2026. What you can do is make a successful injection harmless. The whole strategy is to assume the model will be tricked, and design so that being tricked doesn’t let anything bad happen. That means layering controls so no single failure leads to a breach.
This playbook is the practical companion to our explainer on what prompt injection is and the incidents it caused. Work through the layers below in order — they are roughly ranked by impact.
Layer 1: Least privilege (the highest-impact control)
Almost every documented prompt injection breach — EchoLeak’s data exfiltration, the Copilot and Cursor code-execution CVEs — turned damaging only because the assistant could take a powerful action. Remove the capability and you remove the harm.
- Give each assistant the minimum permissions for its job. A support bot that answers questions does not need write access to your CRM. A summarizer does not need to send email.
- Scope every credential narrowly. API keys and tokens should grant the smallest role that works, never blanket access. A token with full account access is a token that can do full account damage.
- Prefer read-only by default. Grant write, delete, or send permissions only where genuinely required, and treat each as a risk decision.
If your assistant cannot perform a destructive action without a human, a prompt injection that tells it to becomes a failed attempt in your logs — not an incident.
Layer 2: Trust boundaries — treat all retrieved content as untrusted
Indirect prompt injection rides inside content the model reads: emails, web pages, documents, uploads, tool results. The fix mirrors decades of web security practice: never trust input from outside your boundary.
- Classify your data sources. Mark which content is trusted (your own vetted system instructions) and which is untrusted (anything from users, the web, third parties, suppliers).
- Never let reading authorize acting. The model summarizing a document must not be able to, on the strength of that document’s text, trigger a payment or a deletion.
- Be especially careful with agents that browse or ingest. The moment an assistant reads arbitrary external content and holds powerful tools, you have recreated the exact conditions of the headline CVEs. Separate those two functions.
Layer 3: Human-in-the-loop for consequential actions
Automation is the point of AI, but not every action deserves equal automation. Gate the small set of operations where a mistake is expensive or irreversible.
- Require explicit human approval for moving money, deleting or exporting data, external communication, and code execution.
- Make the approval meaningful. Show the human what will happen in plain terms, so they can actually catch an anomaly — not a rubber-stamp dialog they click through.
- This is also AI Act alignment. Meaningful human oversight is both a security control and a compliance posture for higher-risk uses.
Layer 4: Input and output handling
Filtering is a layer, not a wall — but layers add up.
- Screen inputs for known injection patterns, while accepting that determined attackers evolve past filters. Treat it as a speed bump.
- Constrain and validate outputs. If the model’s job is to return a category, a number, or a structured object, enforce that shape — don’t pass free-form output straight into a sensitive system.
- Isolate generated code and commands. Never send model-generated code to an execution path without sandboxing and review. That single gap is what made the Vanna AI RCE possible.
Layer 5: Monitoring and detection
Assume some attempts will land, and make sure you see them.
- Log prompts, tool calls, and outputs so a hijack is reconstructable after the fact.
- Alert on anomalies — unexpected tool calls, outbound data, or actions that don’t match the user’s request. Our piece on spotting a compromised assistant details the signals.
- Review logs regularly. Detection only works if someone looks.
Action plan
- Run a capability audit. List every action and integration each assistant has. This is your attack surface.
- Apply least privilege. Revoke any capability not strictly needed; scope every credential to the narrowest role.
- Draw your trust boundaries. Label trusted vs. untrusted content sources, and ensure reading untrusted content can never authorize an action.
- Gate the dangerous few. Put human approval in front of payments, deletions, external sends, and code execution.
- Sandbox generated code and validate structured outputs before they touch a real system.
- Turn on logging and alerting, then actually review it on a schedule.
A model that can only suggest is low-risk no matter how badly it’s injected. A model that can act needs every layer above. Start with least privilege — it’s the control that turns the most attacks into nothing.
Want a checklist version to bring to your team? Grab our free Deepfake & AI-attack red-flags checklist and adapt these layers into a one-page review for your own deployments.