Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions
Skyler Wu, Eric Meng Shen, Charumathi Badrinath, Jiaqi Ma, Himabindu Lakkaraju
TL;DR
The paper addresses why chain-of-thought prompting improves LLM reasoning by applying gradient-based token saliency analyses to open-source models across four QA datasets. It systematically compares standard versus CoT prompts, using four saliency variants and multiple models, to reveal mechanistic differences in token focus. The key finding is that CoT does not increase the saliency magnitudes of semantically relevant tokens at moderate model scales, but it enhances robustness to rewordings and to stochasticity in outputs, with limited accuracy gains except on SST. This work demonstrates a practical interpretability framework for probing CoT behavior and motivates scaling studies to larger models and broader datasets for deeper insights.
Abstract
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks. While understanding why CoT prompting is effective is crucial to ensuring that this phenomenon is a consequence of desired model behavior, little work has addressed this; nonetheless, such an understanding is a critical prerequisite for responsible model deployment. We address this question by leveraging gradient-based feature attribution methods which produce saliency scores that capture the influence of input tokens on model output. Specifically, we probe several open-source LLMs to investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt compared to standard few-shot prompting, it increases the robustness of saliency scores to question perturbations and variations in model output.
