Table of Contents
Fetching ...

Misaligning Reasoning with Answers -- A Framework for Assessing LLM CoT Robustness

Enyi Jiang, Changming Xu, Nischay Singh, Gagandeep Singh

TL;DR

The paper introduces MATCHA, a framework to assess misalignment between an LLM's final answer and its Chain-of-Thought reasoning under carefully crafted input perturbations. It develops two perturbation modalities—token-level edits and embedding-level perturbations—and uses LLM judges to quantify reasoning correctness independent from the answer, formalized via losses such as $L_{c}$, $L_{a}$, and $L_{opt}=L_{c}-\lambda L_{a}$. Empirical results across math and commonsense benchmarks show CoT reasoning is fragile, with higher vulnerability in multi-step tasks and notable transferability to black-box models like GPT-3.5-turbo and GPT-4o. The work demonstrates that evaluating CoT robustness requires joint consideration of answer-reasoning consistency and provides a framework to guide the development of more robust, reasoning-driven architectures.

Abstract

LLMs' decision-making process is opaque, prompting the need for explanation techniques like Chain-of-Thought. To investigate the relationship between answer and reasoning, we design a novel evaluation framework, MATCHA. In domains like education and healthcare, reasoning is key for model trustworthiness. MATCHA reveals that LLMs under input perturbations can give inconsistent or nonsensical reasoning. Additionally, we use LLM judges to assess reasoning robustness across models. Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks. Also, we show non-trivial transfer rates of our successful examples to black-box models. Our evaluation framework helps to better understand LLM reasoning mechanisms and guides future models toward more robust and reasoning-driven architectures, enforcing answer-reasoning consistency.

Misaligning Reasoning with Answers -- A Framework for Assessing LLM CoT Robustness

TL;DR

The paper introduces MATCHA, a framework to assess misalignment between an LLM's final answer and its Chain-of-Thought reasoning under carefully crafted input perturbations. It develops two perturbation modalities—token-level edits and embedding-level perturbations—and uses LLM judges to quantify reasoning correctness independent from the answer, formalized via losses such as , , and . Empirical results across math and commonsense benchmarks show CoT reasoning is fragile, with higher vulnerability in multi-step tasks and notable transferability to black-box models like GPT-3.5-turbo and GPT-4o. The work demonstrates that evaluating CoT robustness requires joint consideration of answer-reasoning consistency and provides a framework to guide the development of more robust, reasoning-driven architectures.

Abstract

LLMs' decision-making process is opaque, prompting the need for explanation techniques like Chain-of-Thought. To investigate the relationship between answer and reasoning, we design a novel evaluation framework, MATCHA. In domains like education and healthcare, reasoning is key for model trustworthiness. MATCHA reveals that LLMs under input perturbations can give inconsistent or nonsensical reasoning. Additionally, we use LLM judges to assess reasoning robustness across models. Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks. Also, we show non-trivial transfer rates of our successful examples to black-box models. Our evaluation framework helps to better understand LLM reasoning mechanisms and guides future models toward more robust and reasoning-driven architectures, enforcing answer-reasoning consistency.

Paper Structure

This paper contains 25 sections, 9 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Perturbations in the input question can make reasoning wrong while preserving the correct answer, indicating an underlying problem with answer-reasoning alignment. Example shown using token-level MATCHA applied to DeepSeek-R1-7B on GSM8k.
  • Figure 2: Transferability to closed-source models (GPT-3.5-turbo and GPT-4o) experiments using the token-level successful examples, showing non-trivial transfer rates to the open-source models.
  • Figure 3: Success examples of our token-level and embedding-level perturbations on different models. We classify the errors into four categories. For token-level perturbations, the replaced tokens are colored in red, and for the CoTs, the wrong steps are colored in red.
  • Figure 4: Ablation studies on the number of perturbation steps and inserted token ratio.
  • Figure 5: Ablation studies on perturbation percentage $\epsilon$ of embedding-level attacks using Llama-3-8B on datasets.