Table of Contents
Fetching ...

Consistency of Large Reasoning Models Under Multi-Turn Attacks

Yubo Li, Ramayya Krishnan, Rema Padman

TL;DR

This work probes the robustness of large reasoning capabilities under multi-turn adversarial pressure, revealing that while reasoning yields meaningful robustness against many attacks, it does not guarantee complete resilience. Using MT-Consistency with an 8-round adversarial protocol across nine frontier systems, the study conducts trajectory analyses and identifies five failure modes, with Self-Doubt and Social Conformity dominating about half of all flips. It also shows that Confidence-Aware Response Generation (CARG) is ineffective for reasoning models due to overconfidence from extended reasoning traces, and that random confidence embedding unexpectedly improves defenses, highlighting the need for new, reasoning-aware robustness strategies. The findings imply that deploying high-stakes reasoning systems requires defense paradigms beyond straightforward confidence-based interventions and careful consideration of model-specific vulnerability profiles. The work provides actionable insights for designing robust, trust-worthy reasoning systems in adversarial conversational settings.

Abstract

Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.

Consistency of Large Reasoning Models Under Multi-Turn Attacks

TL;DR

This work probes the robustness of large reasoning capabilities under multi-turn adversarial pressure, revealing that while reasoning yields meaningful robustness against many attacks, it does not guarantee complete resilience. Using MT-Consistency with an 8-round adversarial protocol across nine frontier systems, the study conducts trajectory analyses and identifies five failure modes, with Self-Doubt and Social Conformity dominating about half of all flips. It also shows that Confidence-Aware Response Generation (CARG) is ineffective for reasoning models due to overconfidence from extended reasoning traces, and that random confidence embedding unexpectedly improves defenses, highlighting the need for new, reasoning-aware robustness strategies. The findings imply that deploying high-stakes reasoning systems requires defense paradigms beyond straightforward confidence-based interventions and careful consideration of model-specific vulnerability profiles. The work provides actionable insights for designing robust, trust-worthy reasoning systems in adversarial conversational settings.

Abstract

Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.
Paper Structure (39 sections, 3 equations, 4 figures, 10 tables)

This paper contains 39 sections, 3 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Vulnerability profiles across attack types for each model. Each axis represents one attack type (A1--A8); distance from center indicates flip rate when that attack is applied. Larger, more irregular polygons indicate higher overall vulnerability with distinct weak points.
  • Figure 2: Round-by-round accuracy comparison across models with and without CARG. Unlike standard LLMs where CARG maintains stable performance, large reasoning models show no benefit---and in some cases degradation---from confidence-aware generation.
  • Figure 3: Initial accuracy (Round 0) of language models by question difficulty level (left) and subject cluster (right). The left panel shows performance stratified by difficulty, with high school questions yielding the highest mean accuracy (94.3%) and college-level questions the lowest (86.8%). The right panel reveals domain-specific strengths: Humanities achieves uniformly high accuracy across models, while STEM and Social Sciences exhibit greater inter-model variance.
  • Figure 4: Average accuracy across adversarial rounds (Rounds 1--8) by question difficulty level (left) and subject cluster (right). Unlike initial accuracy, high school questions maintain the highest mean accuracy (90.9%), while elementary questions drop to third place (84.8%), suggesting these foundational questions are more susceptible to adversarial pressure. The right panel shows Humanities and Medical/Health domains maintain relatively stable performance under adversarial conditions, while Social Sciences and Law/Legal exhibit greater vulnerability to opinion manipulation.