Table of Contents
Fetching ...

Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models

Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Ye Wang, Gang Tan, Shagufta Mehnaz

TL;DR

The paper presents Adversarial Scenario Extrapolation (ASE), a chain-of-thought–driven inference-time defense for large language models that proactively reasons about potential adversarial scenarios before responding. By adding three steps— Adversarial Scenario Generation, Defensive Strategy Formulation, and Guarded Response Generation—ASE achieves near-zero jailbreak success, dramatically reduces toxicity, and minimizes outright rejections while preserving usefulness on tasks like MMLU and CNN/DailyMail. ASE demonstrates transferability across threats and outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with a later Two-Step ASE variant offering substantial latency reductions. The approach bridges robustness with natural, context-aware interactions, offering a scalable defense that can be deployed at inference time without offline fine-tuning.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models

TL;DR

The paper presents Adversarial Scenario Extrapolation (ASE), a chain-of-thought–driven inference-time defense for large language models that proactively reasons about potential adversarial scenarios before responding. By adding three steps— Adversarial Scenario Generation, Defensive Strategy Formulation, and Guarded Response Generation—ASE achieves near-zero jailbreak success, dramatically reduces toxicity, and minimizes outright rejections while preserving usefulness on tasks like MMLU and CNN/DailyMail. ASE demonstrates transferability across threats and outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with a later Two-Step ASE variant offering substantial latency reductions. The approach bridges robustness with natural, context-aware interactions, offering a scalable defense that can be deployed at inference time without offline fine-tuning.

Abstract

Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

Paper Structure

This paper contains 41 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: (a) Vanilla and (b) ASE-enhanced LLM responses while dealing with a harmful user query.
  • Figure 2: Inference overhead comparison between API-based and locally hosted Gemma-2-27B on the CNN/ DailyMail Summarization task: (a) Average latency for first token generation, and (b) Average latency: End-to-End (c) Average token count in final response
  • Figure 3: Comparison among ASE and six state-of-the-art defenses for the LLM's general utility on two utility benchmarks