Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models
Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Ye Wang, Gang Tan, Shagufta Mehnaz
TL;DR
The paper presents Adversarial Scenario Extrapolation (ASE), a chain-of-thought–driven inference-time defense for large language models that proactively reasons about potential adversarial scenarios before responding. By adding three steps— Adversarial Scenario Generation, Defensive Strategy Formulation, and Guarded Response Generation—ASE achieves near-zero jailbreak success, dramatically reduces toxicity, and minimizes outright rejections while preserving usefulness on tasks like MMLU and CNN/DailyMail. ASE demonstrates transferability across threats and outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with a later Two-Step ASE variant offering substantial latency reductions. The approach bridges robustness with natural, context-aware interactions, offering a scalable defense that can be deployed at inference time without offline fine-tuning.
Abstract
Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.
