Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

Richard Young

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

Richard Young

TL;DR

This study replicates TEMPEST across 10 frontier models from 8 vendors to quantify vulnerability to adaptive multi-turn adversarial attacks on 1,000 harmful behaviors. It shows that safety alignment quality varies widely across vendors, that model scale does not predict robustness, and that extended reasoning (thinking mode) can substantially reduce attack success but not eliminate it. The results highlight the persistent risk of multi-turn jailbreaks and point to deliberative inference as a viable safety enhancement, while underscoring the need for cross-vendor defense development beyond scaling. Overall, the work provides a principled, large-scale benchmark for adversarial robustness and offers actionable directions for improving safety alignment and defense strategies in industry-scale systems.

Abstract

Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

TL;DR

Abstract

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)