How would I even know my AI assistant was manipulated?

By its behaviour, not its words. Watch for actions that don't match the user's request: unexpected tool calls, data being sent to unfamiliar destinations, outputs containing hidden links or encoded content, or new persistent 'memories' nobody set. A compromised assistant usually keeps sounding normal while doing something it wasn't asked to.

Can I detect prompt injection just by reading the AI's reply?

Often not. The dangerous part of an indirect injection is what the assistant does behind the scenes — a tool call, a data fetch, an outbound request — which may never appear in the visible reply. That's why logging tool calls and outbound traffic matters more than reading the chat transcript alone.

What is a canary token and how does it help?

A canary token is a unique, trackable marker you plant where only an attacker would reach it. If it ever appears in an outbound request or a log, you know that data path was triggered. It's a cheap way to detect silent exfiltration that would otherwise leave no obvious trace.

How does AI transparency regulation relate to detection?

The EU AI Act's Article 50 transparency duties and content-provenance standards like C2PA make AI activity more visible and attributable. The same instinct that drives detection — knowing what an AI did and being able to prove it — underpins both security monitoring and regulatory compliance.

How to Tell if Your AI Assistant Has Been Compromised

The short answer

A successful prompt injection rarely announces itself. The assistant keeps replying in its normal, helpful voice while quietly doing something it was never asked to do. So you detect a compromised assistant by its behaviour, not its words — by watching what it does (tool calls, data movement, actions) against what the user actually requested.

This is the detection counterpart to our defense playbook. Prevention reduces the odds; detection catches what gets through. You need both.

The signals that something is wrong

Treat any of these as a reason to investigate.

1. Actions that don’t match the request. The user asked for a summary; the assistant tried to send an email, modify a record, or call an external service. A mismatch between intent and action is the clearest tell. In the EchoLeak case against Microsoft 365 Copilot, an ordinary user question triggered data exfiltration the user never asked for — the action diverged from the request.

2. Data heading somewhere unfamiliar. Watch for outbound requests to domains you don’t recognize, content being placed into URLs, or unusually large responses. Exfiltration often disguises itself as a normal-looking fetch.

3. Hidden or encoded content in outputs. Injections frequently try to smuggle data out through things the user won’t notice — a markdown image whose URL contains stolen text, base64 blobs, invisible characters, or links pointing at attacker infrastructure. If your assistant’s output contains links or images you didn’t expect, inspect where they point.

4. New or altered persistent state. If your assistant has memory, check it. The ChatGPT “SpAIware” research showed an injection planting false long-term memories that kept exfiltrating data across future sessions. Memories or saved instructions nobody deliberately set are a red flag.

5. Behaviour that drifts after reading external content. If an assistant behaves normally until it ingests a particular email, document, or web page — and then starts acting oddly — that content is your prime suspect for an indirect injection.

Under the hood: how data actually leaves

Exfiltration through a chatbot is rarely a dramatic data dump. The classic technique is to get the model to embed stolen information inside a request that fires automatically — most famously a markdown image tag, where simply rendering the reply causes the client to fetch a URL the attacker controls, with the secret encoded in the path. No click required. This is why the rendering surface (your chat UI, your email client, your IDE) is part of the attack surface, and why restricting which destinations the assistant can reach matters as much as filtering its text.

Build monitoring that catches it

Detection only works if you’ve instrumented for it before the incident.

Log every tool call and action, with inputs and outputs, tied to the originating user request. This is the single most valuable record you can keep.
Monitor outbound traffic and alert on requests to destinations outside an allowlist. Most exfiltration needs to reach the open internet — constrain that and you constrain the damage.
Plant canary tokens in sensitive data and contexts. If a canary ever shows up in an outbound request or an unexpected log, a hidden data path just fired.
Validate and inspect outputs for unexpected links, images, or encoded blobs before they’re rendered or forwarded.
Diff persistent state — memories, saved settings, agent instructions — against a known-good baseline on a schedule.

The transparency angle

Detection inside your own systems is one half of the picture. The other is the broader move toward making AI activity visible and attributable — which is exactly where AI transparency regulation is heading.

The EU AI Act’s Article 50 sets transparency obligations: users should know when they’re dealing with AI, and certain AI-generated content must be marked. Content-provenance standards like C2PA and watermarking aim to make the origin of media verifiable. The connecting thread with security is the same instinct: knowing what an AI did, and being able to prove it. An organization that logs, attributes, and can reconstruct its AI’s actions is better positioned for both incident response and compliance.

This is where the defensive and the operational sides of AI meet. Teams that deploy AI to do work for the business — not just publish about its risks — have to bake this visibility in from the start; it’s a theme our sister publication managerAI works on from the implementation side. Detection isn’t a bolt-on; it’s a design property.

Action plan

Instrument before you need it. Turn on logging of tool calls, actions, and outbound requests, linked to the triggering user request.
Allowlist outbound destinations. Restrict where your assistant can send or fetch data, and alert on anything off-list.
Plant canary tokens in sensitive contexts so silent exfiltration trips an alarm.
Inspect outputs for unexpected links, images, and encoded content before rendering or forwarding.
Baseline and diff persistent state so planted memories or altered instructions get caught.
Review on a schedule. Detection is worthless if no one looks at the alerts.

A compromised assistant will keep talking like nothing happened. The only reliable way to catch it is to watch what it does — and to have decided, before the incident, exactly what “normal” looks like.