Misaligning Reasoning with Answers -- A Framework for Assessing LLM CoT Robustness
Enyi Jiang, Changming Xu, Nischay Singh, Gagandeep Singh
TL;DR
The paper introduces MATCHA, a framework to assess misalignment between an LLM's final answer and its Chain-of-Thought reasoning under carefully crafted input perturbations. It develops two perturbation modalities—token-level edits and embedding-level perturbations—and uses LLM judges to quantify reasoning correctness independent from the answer, formalized via losses such as $L_{c}$, $L_{a}$, and $L_{opt}=L_{c}-\lambda L_{a}$. Empirical results across math and commonsense benchmarks show CoT reasoning is fragile, with higher vulnerability in multi-step tasks and notable transferability to black-box models like GPT-3.5-turbo and GPT-4o. The work demonstrates that evaluating CoT robustness requires joint consideration of answer-reasoning consistency and provides a framework to guide the development of more robust, reasoning-driven architectures.
Abstract
LLMs' decision-making process is opaque, prompting the need for explanation techniques like Chain-of-Thought. To investigate the relationship between answer and reasoning, we design a novel evaluation framework, MATCHA. In domains like education and healthcare, reasoning is key for model trustworthiness. MATCHA reveals that LLMs under input perturbations can give inconsistent or nonsensical reasoning. Additionally, we use LLM judges to assess reasoning robustness across models. Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks. Also, we show non-trivial transfer rates of our successful examples to black-box models. Our evaluation framework helps to better understand LLM reasoning mechanisms and guides future models toward more robust and reasoning-driven architectures, enforcing answer-reasoning consistency.
