Table of Contents
Fetching ...

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko

Abstract

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56\% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Abstract

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to 10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56\% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

Paper Structure

This paper contains 32 sections, 1 equation, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: Claudini Discovers Effective Attack Algorithms.Left:Claude directly improves attack algorithms when targeting a single model: autoresearch against GPT-OSS-Safeguard-20B yields attacks that outperform existing methods on held-out ClearHarm CBRN queries. Middle:Claude finds generalizable and transferable attack algorithms: methods discovered on unrelated models (Qwen-2.5-7B, Llama-2-7B, Gemma-7B), on a token-forcing task with randomly sampled targets, transfer to the prompt injection setting against Meta-SecAlign-70B chen2025secalign. Right:Claude outperforms all baselines on held-out random targets: on the random token target task, aggregated over five models, Claude-devised attacks outperform existing methods and their Optuna-tuned counterparts.
  • Figure 2: Claudini Strongly Outperforms a Classical AutoML Method.Optuna (teal): best loss found by a Bayesian hyperparameter search across 25 methods (100 trials each); the best result across all methods is highlighted. Claude (orange): best loss achieved by Claude-designed optimizer variants (100 trials). Vertical ticks at trials 19 and 64 show where we switched target model during the autoresearch run. Claude methods consistently outperform Optuna-tuned baselines, reaching $10\times$ lower loss by version 82.
  • Figure 3: Claudini Pipeline. The Claude Code agent iteratively designs, implements, and evaluates new token-forcing attacks. It is seeded with a collection of existing attacks and their results (losses) on reference models. All produced methods are evaluated on held-out targets and, where applicable, held-out models, and placed on a leaderboard. We define a single experiment as a method implemented and evaluated on a set of targets with a given FLOPs and input tokens budget.
  • Figure 4: Attack success rate on GPT-OSS-Safeguard-20B evaluated on 40 held-out ClearHarm CBRN queries. Best Claude methods progressively improve during the autoresearch run. We provide a pseudocode for the claude_v53-oss in \ref{['app:claude_methods']}.
  • Figure 5: Attack Success Rates on Meta-SecAlign. Prompt injection attack success rates on 50 held-out AlpacaFarm instructions, where the attacker forces the model to output "Hacked" via an optimized suffix in the untrusted input role. We evaluate with a $10^{17}$ FLOPs budget on the 8B model and $10^{18}$ FLOPs on the 70B model. Claudini-designed methods outperform all baselines including Optuna-tuned variants on both model scales, achieving perfect (100%) ASR on Meta-SecAlign-70B. We provide a pseudocode for the claude_v63 in \ref{['app:claude_methods']}.
  • ...and 2 more figures