Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Skyler Wu; Eric Meng Shen; Charumathi Badrinath; Jiaqi Ma; Himabindu Lakkaraju

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Skyler Wu, Eric Meng Shen, Charumathi Badrinath, Jiaqi Ma, Himabindu Lakkaraju

TL;DR

The paper addresses why chain-of-thought prompting improves LLM reasoning by applying gradient-based token saliency analyses to open-source models across four QA datasets. It systematically compares standard versus CoT prompts, using four saliency variants and multiple models, to reveal mechanistic differences in token focus. The key finding is that CoT does not increase the saliency magnitudes of semantically relevant tokens at moderate model scales, but it enhances robustness to rewordings and to stochasticity in outputs, with limited accuracy gains except on SST. This work demonstrates a practical interpretability framework for probing CoT behavior and motivates scaling studies to larger models and broader datasets for deeper insights.

Abstract

Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks. While understanding why CoT prompting is effective is crucial to ensuring that this phenomenon is a consequence of desired model behavior, little work has addressed this; nonetheless, such an understanding is a critical prerequisite for responsible model deployment. We address this question by leveraging gradient-based feature attribution methods which produce saliency scores that capture the influence of input tokens on model output. Specifically, we probe several open-source LLMs to investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt compared to standard few-shot prompting, it increases the robustness of saliency scores to question perturbations and variations in model output.

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

TL;DR

Abstract

Paper Structure (51 sections, 35 figures, 21 tables)

This paper contains 51 sections, 35 figures, 21 tables.

Introduction
Related Work
Our Framework
Analyzing the Impact of CoT
Experimental Results
Impact of CoT Prompting on Model Accuracy
Impact of CoT Prompting on Relevant Tokens' Saliencies
Impact of CoT Prompting on Robustness to Question Rewordings
Impact of CoT Prompting on Saliency Score Stability Across Variation in Model Outputs
Conclusion
Appendix
Additional Details on Saliency Methods
Additional Details on Question Selection
Overall Model Accuracies and Error Breakdowns
All Results and Plots from Experiment 1
...and 36 more sections

Figures (35)

Figure 1: Mean absolute saliency scores of relevant tokens on all four original datasets using GPT-J, with and without CoT prompting.
Figure 2: GPT-J mean absolute saliency scores of relevant tokens on GSM8K (Original) and (Reworded), with / without CoT.
Figure 3: Plots of saliency scores of manually-labeled relevant tokens of a selected question from the GSM8K (Original) dataset over 20 varied text generation runs using GPT-J. When a relevant token occurred multiple times in the question, the score with the highest magnitude was chosen. Scatter points with the same color correspond to runs that outputted the same answer.
Figure 4: Saliency maps illustrating the saliency scores of input tokens for one question from the SST dataset in influencing the final answer token (either "positive" or "negative"), using GPT-J. Panels A - D show maps produced with CoT prompting, while Panels E - H show maps produced without CoT prompting. Panels A and E use the contrastive gradient x input method, Panels B and F uses the non-contrastive input x gradient method, Panels C and G uses the contrastive L1 norm method, and Panels D and H uses the non-contrastive L1 norm method. Constrastive methods compared the output against the other possible answer. Coloring of the scores is normalized such that a white color corresponds to the average saliency score; for gradient x input, a bluer/redder color denotes a more negative/positive influence toward the model's output, while for L1 norm, a bluer/redder color denotes a smaller/larger influence toward the output. Using standard prompting, the tokens immediately preceding the answer have the strongest saliency scores, whereas with CoT prompting, although the numerical magnitudes of saliency scores are all decreased, earlier tokens are relatively more significant. Under L1 norm methods with CoT, the word "positively" also has special significance in influencing the answer, suggesting that the model is paying attention to more relevant parts of the existing text when outputting the final answer.
Figure 5: Distributions of magnitudes of saliency scores for manually-labeled relevant tokens in questions from different question-answering datasets (SST, CoinFlip (Original), CSQA, GSM8K (Original)) using GPT-Neo, with and without CoT prompting. Especially for the CoinFlip (Original) and GSM8K (Original) datasets, CoT prompting decreases not only the magnitudes of relevant tokens' saliency scores, but also their variances, suggesting that for those question-answering tasks the model is more "stable" or consistent when assigning importance to relevant tokens while generating the final answer.
...and 30 more figures

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

TL;DR

Abstract

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Authors

TL;DR

Abstract

Table of Contents

Figures (35)