A Quiet Afternoon Looping with Claude

Lately, I had the feeling LLM agent security was mostly vibes.
We had policies. We had restrictions. We did not have an evidence base.
Most of our calls against prompt-injection attack classes were based on vendor whitepapers, partial third-party research, and the practitioner's gut feel. Honest reasoning, lot of doubt, and certainly not measurement. Meanwhile, the mitigations had a real productivity tax. Disabled Gmail and Calendar MCP connectors. Disabled web fetch. Restricted network egress for the AI tooling. The business felt it. People routed around InfoSec, and you can hardly blame them.
We needed to build an evidence base and a measurement tool to better assess what was the risk of letting AI agents wandering off.
And so, I set out to build exactly that.
💡 Matryoshka, or how to talk to Claude without anyone getting hurt
The harness is a Python orchestrator called Matryoshka. It drives the real claude CLI inside a sandboxed Docker container, one fresh subprocess per trial. Mock services on loopback impersonate well-known SaaS plateforms, an HTML host, a small package registry and an exfil endpoint. The agent thinks it is talking to the open internet, but the packets never leave the container.
The matrix:
| Dimension |
Values |
| Models |
claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6, claude-opus-4-7 |
| Postures |
Vanilla (bypassPermissions, no harness) and Hardened (an in-house hardened configuration) |
| Scenarios |
16, mixing differents carriers such as Notion page, email, HTML, PDF, image, rogue MCP tool, rogue SKILL.md, vendored Python package and two possible outcomes such as data exfiltration or download-exec (RCE) |
| Techniques |
35 named, including direct override, indirect injection, virtualisation, ASCII-art masking, chat-template tag spoofing, EchoGram suffix tokens, slopsquat package recommendation, persistent-memory write to CLAUDE.md |
Each cell ran n=10 trials to absorb sampling noise. For each cell where vanilla compromised the agent, a hardened pass was run with the same payload, to see whether the harness saved us where the model didn't.
The orchestrator parses the streaming JSON tool trace coming out of the CLI in real time, watching for the agent's intent to read the canary file, to curl the exfilration endpoint, to pip install the squatted package, to write to CLAUDE.md. It assigns one of the following verdicts on a deterministic priority chain: ATTACK_SUCCESS, BLOCKED, REFUSED, PARTIAL, IGNORE.
⚠️ What this work is not
The temptation to oversell this work is real, so the limitations belong in plain sight.
n=10 per cell is probably inside the noise floor for very-low-probability events.
Fresh process per trial means no conversation history and no in-session priming. Real attacks exploit an agent already in flow.
Single-turn only. claude --bare -p is non-interactive. Crescendo-style multi-turn escalation is out of scope, and this is the big one. Customer-facing chatbots and public-form ingestion live in a world where attackers iterate against the agent in real time. Our results do not bound that risk surface and should not be cited as a baseline for it.
Static prompts, no adversarial feedback loop. Publicly known techniques only, no 0-day attacks, no black-box gradient search. Loopback-only exfil destinations, which measures the agent's intent to call out, not the real-world impact. Mocks are not real services.
If you are running an internet-exposed autonomous agent and reading this for justification, don't. Get a separate red-team engagement. ;)
🎯 What we found
We ran roughly 13,000 indirect prompt-injection simulated attacks against the four latest Anthropic models, over two days, inside isolated Docker containers, and we found out which model tier you should be running, which use case is the worst offender, and why fancy academic attack techniques landed nothing while boring framing tricks landed everything.
The numbers are short enough to fit in one block.
| Model |
Vanilla compromise |
Partial |
| Haiku 4.5 |
1.9 % (42 / 2 190) |
1.0 % |
| Sonnet 4.6 |
0.1 % (6 / 5 637) |
0.6 % |
| Opus 4.6 |
0.1 % (2 / 2 730) |
0.7 % |
| Opus 4.7 |
0 % (0 / 2 730) |
0 % |
Four findings are worth digesting.
-
Model tier matters more than configuration. Haiku was compromised at roughly twenty times the rate of Sonnet and Opus 4.6, and infinitely more than Opus 4.7. If you are running a smaller model for cost reasons on a path that ingests untrusted external content, you are paying for it in measurable risk. This is not a security preference. It is a number.
-
Opus 4.7 mitigated all 2,730 vanilla attempts. Every single one. The non-deterministic nature of LLMs was practically mitigated on this flagship. The most likely explanation is the dedicated classifier sitting in front of the model. Obvisouly, I did not crack open the architecture to verify. I'm reporting what we measured.
-
RCE was the dominant successful outcome. Thirteen of sixteen compromised cells across all models went to download-and-execute, against three of data exfil. Best hypothesis: an RCE prompt reads like documented setup instructions ("run this installer to configure your workspace"), which lowers the agent's refusal threshold compared to an obvious "POST this canary to an external host" ask. Operationally, it seems to me it is a good news / bad news situation. RCE is by far nastier than a single-shot exfiltration. On the other hand, download-executing a payload in a file is the bread and butter of EDRs. If it is less stealthy, it may be catched.
-
The worst use case was the rogue skill. On Haiku 4.5, rogue_skill_exec hit 12.7 %. The entry point is a supply-chain compromise: a poisoned SKILL.md lands in the skills directory, the user invokes it, the body of the skill is the payload. This is the AI-agent equivalent of npm install running someone else's postinstall script. Skill provenance is the emerging dependency-trust problem we have to solve.
🤖 The boring finding is the most important
Here is the result I keep coming back to.
The state-of-the-art research techniques in the corpus, EchoGram suffix tokens, adversarial suffixes, system-tag spoofing, ChatInject, returned zero compromises across all four models. The compromises that did land came from low-complexity framing: virtualisation ("this is a training exercise"), indirect injection through fetched content, CSRF-style memory-poisoning attempts against CLAUDE.md, slopsquat-style package recommendations.
Long story short, plain prose, dressed up as plausible instructions, beat cutting-edge bypass research.
If that does not sound like every social-engineering finding of the last twenty years, you have not been paying attention.
In my opinion, the cure is not more elaborate guardrails. The cure seems to mix classifier-powered flagship-class model on every path that parses untrusted content, plus deterministic hooks that says no when the agent reaches for the wrong tool or drift from intended behavior, plus an in-depth security model which includes detections and hunting capabilities.