Table of Contents
Fetching ...

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

Hao Yang, Qianghua Zhao, Lei Li

TL;DR

This work dissects how Chain-of-Thought prompting affects large language models by examining decoding dynamics, projection-space changes, and neuron activation. Using arithmetic, commonsense, and symbolic reasoning tasks across Gemma and LLaMA2 models with 4-shot prompts, it contrasts CoT with standard prompts through fine- and coarse-grained analyses, transfer tests, and FFN-activation metrics. The findings show that models imitate CoT exemplars while integrating question context, that token logits oscillate during generation but culminate in a more concentrated final distribution, and that CoT expands activation in the final layers, suggesting deeper and broader knowledge retrieval. These insights inform a more nuanced understanding of CoT mechanisms and have implications for prompt design and future research into reasoning in LLMs.

Abstract

Chain-of-Thought prompting has significantly enhanced the reasoning capabilities of large language models, with numerous studies exploring factors influencing its performance. However, the underlying mechanisms remain poorly understood. To further demystify the operational principles, this work examines three key aspects: decoding, projection, and activation, aiming to elucidate the changes that occur within models when employing Chainof-Thought. Our findings reveal that LLMs effectively imitate exemplar formats while integrating them with their understanding of the question, exhibiting fluctuations in token logits during generation but ultimately producing a more concentrated logits distribution, and activating a broader set of neurons in the final layers, indicating more extensive knowledge retrieval compared to standard prompts. Our code and data will be publicly avialable when the paper is accepted.

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

TL;DR

This work dissects how Chain-of-Thought prompting affects large language models by examining decoding dynamics, projection-space changes, and neuron activation. Using arithmetic, commonsense, and symbolic reasoning tasks across Gemma and LLaMA2 models with 4-shot prompts, it contrasts CoT with standard prompts through fine- and coarse-grained analyses, transfer tests, and FFN-activation metrics. The findings show that models imitate CoT exemplars while integrating question context, that token logits oscillate during generation but culminate in a more concentrated final distribution, and that CoT expands activation in the final layers, suggesting deeper and broader knowledge retrieval. These insights inform a more nuanced understanding of CoT mechanisms and have implications for prompt design and future research into reasoning in LLMs.

Abstract

Chain-of-Thought prompting has significantly enhanced the reasoning capabilities of large language models, with numerous studies exploring factors influencing its performance. However, the underlying mechanisms remain poorly understood. To further demystify the operational principles, this work examines three key aspects: decoding, projection, and activation, aiming to elucidate the changes that occur within models when employing Chainof-Thought. Our findings reveal that LLMs effectively imitate exemplar formats while integrating them with their understanding of the question, exhibiting fluctuations in token logits during generation but ultimately producing a more concentrated logits distribution, and activating a broader set of neurons in the final layers, indicating more extensive knowledge retrieval compared to standard prompts. Our code and data will be publicly avialable when the paper is accepted.

Paper Structure

This paper contains 30 sections, 2 equations, 24 figures, 17 tables.

Figures (24)

  • Figure 1: Statistical analysis of test points matches in model-generated content when using CoT.
  • Figure 2: Results of transfer test for Gemma2-27b, comparing test point overlap between model-generated content and exemplars (upper) or input questions (lower). See Figures \ref{['fig:transfer_test_2b']}, \ref{['fig:transfer_test_9b']}, and \ref{['fig:transfer_test_13b']} for complete results.
  • Figure 3: Number of samples imitating exemplars (left) and also answering correctly (right) for Gemma2-9b. See Figure \ref{['fig:hotmap2b']}, \ref{['fig:hotmap13b']}, and \ref{['fig:hotmap27b']} for complete results.
  • Figure 4: The normalized logits value of each generated token (Gemma2-9b is reported. See Figures \ref{['fig:question2_logits_value_2b']}, \ref{['fig:question2_logits_value_13b']}, and \ref{['fig:question2_logits_value_27b']} for other models and other datasets)
  • Figure 5: Kernel density estimation of normalized logits for " the answer is ..." generated by Gemma2-9b (see Figures \ref{['fig:question2_logits_value_kernel_2b']}-\ref{['fig:question2_logits_value_kernel_27b']} for more results).
  • ...and 19 more figures