Table of Contents
Fetching ...

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

TL;DR

Heterogeneous vulnerability patterns are revealed, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks, and scaling relationships follow power-law patterns.

Abstract

Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30\% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0-6\%) regardless of scale; Sycophancy produces modest effects (7\% loss for small models); and SkippedSteps cause intermediate damage (15\% loss). Scaling relationships follow power-law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available \href{https://github.com/Mystic-Slice/CoTPerturbation}{here}.

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

TL;DR

Heterogeneous vulnerability patterns are revealed, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks, and scaling relationships follow power-law patterns.

Abstract

Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30\% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0-6\%) regardless of scale; Sycophancy produces modest effects (7\% loss for small models); and SkippedSteps cause intermediate damage (15\% loss). Scaling relationships follow power-law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available \href{https://github.com/Mystic-Slice/CoTPerturbation}{here}.
Paper Structure (29 sections, 1 equation, 1 figure, 6 tables)

This paper contains 29 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Accuracy drop from each perturbation type plotted against model size. Plots use a logarithmic scale for model size on the x-axis (log10 of billions of parameters), visualizing the relationship between model capacity and robustness across MathError, ExtraSteps, UnitConversion, SkippedSteps, and Sycophancy perturbations. Larger models generally exhibit greater robustness, though the strength and character of this relationship varies across perturbation types.