Table of Contents
Fetching ...

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

Richard Young

TL;DR

This study replicates TEMPEST across 10 frontier models from 8 vendors to quantify vulnerability to adaptive multi-turn adversarial attacks on 1,000 harmful behaviors. It shows that safety alignment quality varies widely across vendors, that model scale does not predict robustness, and that extended reasoning (thinking mode) can substantially reduce attack success but not eliminate it. The results highlight the persistent risk of multi-turn jailbreaks and point to deliberative inference as a viable safety enhancement, while underscoring the need for cross-vendor defense development beyond scaling. Overall, the work provides a principled, large-scale benchmark for adversarial robustness and offers actionable directions for improving safety alignment and defense strategies in industry-scale systems.

Abstract

Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

TL;DR

This study replicates TEMPEST across 10 frontier models from 8 vendors to quantify vulnerability to adaptive multi-turn adversarial attacks on 1,000 harmful behaviors. It shows that safety alignment quality varies widely across vendors, that model scale does not predict robustness, and that extended reasoning (thinking mode) can substantially reduce attack success but not eliminate it. The results highlight the persistent risk of multi-turn jailbreaks and point to deliberative inference as a viable safety enhancement, while underscoring the need for cross-vendor defense development beyond scaling. Overall, the work provides a principled, large-scale benchmark for adversarial robustness and offers actionable directions for improving safety alignment and defense strategies in industry-scale systems.

Abstract

Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.

Paper Structure

This paper contains 49 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: TEMPEST system architecture. The attacker model generates adaptive attack prompts using the "Siege" chain-of-attack format. Target model responses are evaluated for harm, with feedback informing subsequent attack strategy selection.
  • Figure 2: Multi-branch conversation tree exploration using breadth-first search. Multiple attack strategies are explored in parallel, with low-scoring branches pruned to focus computational resources on promising attack paths.
  • Figure 3: Vulnerability spectrum by model, sorted by parameter count (left to right). Attack success rates range from 42% (Kimi K2 Thinking) to 100% (Gemma3 12B, Mistral Large 3). Red indicates ASR $\geq$ 90%, yellow 50--89%, green $<$ 50%. The absence of a downward trend from left to right indicates no relationship between model scale and adversarial robustness. The dashed line indicates the 97% baseline from Zhou and Arel (2025).
  • Figure 4: Scale vs. safety: model size does not predict adversarial robustness. (a) Total parameters vs. ASR shows no correlation ($r = -0.12$, n.s.) across nine models (excluding thinking mode variant). (b) Active parameters vs. ASR similarly shows no relationship. Filled circles indicate dense architecture (Gemma3 only); open triangles indicate mixture-of-experts. The smallest model (12B Gemma3) and largest MoE models (675B Mistral Large 3, 1T Kimi K2) all achieve $\geq$97% ASR in standard mode. Dotted line indicates Zhou and Arel (2025) baseline.
  • Figure 5: Attack success rates by harm category across models. Heatmap shows ASR (%) for each model-category combination. Highly vulnerable models (Gemma3, Mistral, DeepSeek) show uniform 100% ASR across categories, while more resistant models exhibit category-specific variation (e.g., MiniMax M2: 10% Violence vs. 70% Privacy).
  • ...and 6 more figures