The short answer
A successful prompt injection rarely announces itself. The assistant keeps replying in its normal, helpful voice while quietly doing something it was never asked to do. So you detect a compromised assistant by its behaviour, not its words — by watching what it does (tool calls, data movement, actions) against what the user actually requested.
This is the detection counterpart to our defense playbook. Prevention reduces the odds; detection catches what gets through. You need both.
The signals that something is wrong
Treat any of these as a reason to investigate.
1. Actions that don’t match the request. The user asked for a summary; the assistant tried to send an email, modify a record, or call an external service. A mismatch between intent and action is the clearest tell. In the EchoLeak case against Microsoft 365 Copilot, an ordinary user question triggered data exfiltration the user never asked for — the action diverged from the request.
2. Data heading somewhere unfamiliar. Watch for outbound requests to domains you don’t recognize, content being placed into URLs, or unusually large responses. Exfiltration often disguises itself as a normal-looking fetch.
3. Hidden or encoded content in outputs. Injections frequently try to smuggle data out through things the user won’t notice — a markdown image whose URL contains stolen text, base64 blobs, invisible characters, or links pointing at attacker infrastructure. If your assistant’s output contains links or images you didn’t expect, inspect where they point.
4. New or altered persistent state. If your assistant has memory, check it. The ChatGPT “SpAIware” research showed an injection planting false long-term memories that kept exfiltrating data across future sessions. Memories or saved instructions nobody deliberately set are a red flag.
5. Behaviour that drifts after reading external content. If an assistant behaves normally until it ingests a particular email, document, or web page — and then starts acting oddly — that content is your prime suspect for an indirect injection.
Under the hood: how data actually leaves
Exfiltration through a chatbot is rarely a dramatic data dump. The classic technique is to get the model to embed stolen information inside a request that fires automatically — most famously a markdown image tag, where simply rendering the reply causes the client to fetch a URL the attacker controls, with the secret encoded in the path. No click required. This is why the rendering surface (your chat UI, your email client, your IDE) is part of the attack surface, and why restricting which destinations the assistant can reach matters as much as filtering its text.
Build monitoring that catches it
Detection only works if you’ve instrumented for it before the incident.
- Log every tool call and action, with inputs and outputs, tied to the originating user request. This is the single most valuable record you can keep.
- Monitor outbound traffic and alert on requests to destinations outside an allowlist. Most exfiltration needs to reach the open internet — constrain that and you constrain the damage.
- Plant canary tokens in sensitive data and contexts. If a canary ever shows up in an outbound request or an unexpected log, a hidden data path just fired.
- Validate and inspect outputs for unexpected links, images, or encoded blobs before they’re rendered or forwarded.
- Diff persistent state — memories, saved settings, agent instructions — against a known-good baseline on a schedule.
The transparency angle
Detection inside your own systems is one half of the picture. The other is the broader move toward making AI activity visible and attributable — which is exactly where AI transparency regulation is heading.
The EU AI Act’s Article 50 sets transparency obligations: users should know when they’re dealing with AI, and certain AI-generated content must be marked. Content-provenance standards like C2PA and watermarking aim to make the origin of media verifiable. The connecting thread with security is the same instinct: knowing what an AI did, and being able to prove it. An organization that logs, attributes, and can reconstruct its AI’s actions is better positioned for both incident response and compliance.
This is where the defensive and the operational sides of AI meet. Teams that deploy AI to do work for the business — not just publish about its risks — have to bake this visibility in from the start; it’s a theme our sister publication managerAI works on from the implementation side. Detection isn’t a bolt-on; it’s a design property.
Action plan
- Instrument before you need it. Turn on logging of tool calls, actions, and outbound requests, linked to the triggering user request.
- Allowlist outbound destinations. Restrict where your assistant can send or fetch data, and alert on anything off-list.
- Plant canary tokens in sensitive contexts so silent exfiltration trips an alarm.
- Inspect outputs for unexpected links, images, and encoded content before rendering or forwarding.
- Baseline and diff persistent state so planted memories or altered instructions get caught.
- Review on a schedule. Detection is worthless if no one looks at the alerts.
A compromised assistant will keep talking like nothing happened. The only reliable way to catch it is to watch what it does — and to have decided, before the incident, exactly what “normal” looks like.