Anthropic published the prompt injection failure rates that enterprise security teams have been asking every vendor for

1 2 minutes read

The recent findings from Anthropic’s prompt injection attack against Claude Opus 4.6 have shed light on the importance of understanding attack success rates based on different surfaces. In a constrained coding environment, the attack failed every time, showing a 0% success rate across 200 attempts without the need for safeguards. However, when the same attack was moved to a GUI-based system with extended thinking enabled, the success rate increased significantly. A single attempt managed to breach the system 17.8% of the time without safeguards, escalating to 78.6% by the 200th attempt without safeguards and 57.1% with them.

The latest system card released by Anthropic breaks down the attack success rates by surface, attempt count, and safeguard configuration, providing security leaders with valuable data to inform their procurement decisions. This level of granularity in disclosing attack success rates is a game-changer for enterprise risk assessment, as it allows security teams to make more informed decisions.

Comparing the disclosures from Anthropic, OpenAI, and Google, it is evident that the level of detail provided by each developer varies significantly. Anthropic’s disclosure includes per-surface attack success rates, attack persistence scaling data, safeguard on/off comparison, agent monitoring evasion data, and zero-day discovery counts. On the other hand, OpenAI and Google provide more limited information, focusing on benchmark scores and relative improvements without delving into specific attack success rates or persistence scaling data.

One of the key takeaways from Anthropic’s findings is the importance of understanding the evolving threat landscape posed by AI agents. As models become more advanced, the risks associated with prompt injections and other vulnerabilities increase. Security teams must adapt their strategies to mitigate these risks effectively.

Real-world attacks, such as the recent vulnerability discovered in Anthropic’s Claude Cowork system, highlight the urgent need for robust security measures. PromptArmor researchers were able to exploit a hidden prompt injection to steal confidential user files, demonstrating the real-world implications of these vulnerabilities.

The evaluation integrity problem, as highlighted by Anthropic’s use of Opus 4.6 to debug its own evaluation infrastructure, underscores the need for independent evaluation and red team testing. Security leaders must be vigilant in ensuring that AI agent vendors provide transparent and verifiable security data to inform their decision-making processes.

In light of these revelations, security leaders should take proactive steps before their next vendor evaluation. By requesting per-surface attack success rates, commissioning independent red team evaluations, and validating agent security claims against independent results, organizations can enhance their cybersecurity posture and mitigate the risks posed by advanced AI agents.