Table of Contents
Fetching ...

Extending Token Computation for LLM Reasoning

Bingli Liao, Danilo Vasconcellos Vargas

TL;DR

The paper addresses attention inefficiencies in LLM reasoning by examining token computation in extended Chain-of-Thought (CoT) and identifying skew from non-semantic tokens after domain-specific fine-tuning. It introduces a top-layer-guided, training-free attention optimization that emulates early-layer patterns across downstream layers, formalized as $A_l(i, j) = A_l(i, j) + A_l(i, j) \cdot \left(1 - \frac{l}{h}\right) \cdot M_t(i, j)$ for all $i, j \notin D$, to rebalance attention and improve knowledge abstraction. Evaluations on LLaMA-2 with MMLU, PIQA, and SIQA show that larger models (13B) gain in Non-STEM reasoning and related tasks, while memory-dependent and STEM domains present trade-offs, suggesting extended CoT can enhance reasoning when managed carefully. Overall, the work advances understanding of internal LLM dynamics and offers a practical, training-free method to improve cross-domain reasoning, with implications for designing more capable and interpretable LLMs.

Abstract

Large Language Models (LLMs) are pivotal in advancing natural language processing but often struggle with complex reasoning tasks due to inefficient attention distributions. In this paper, we explore the effect of increased computed tokens on LLM performance and introduce a novel method for extending computed tokens in the Chain-of-Thought (CoT) process, utilizing attention mechanism optimization. By fine-tuning an LLM on a domain-specific, highly structured dataset, we analyze attention patterns across layers, identifying inefficiencies caused by non-semantic tokens with outlier high attention scores. To address this, we propose an algorithm that emulates early layer attention patterns across downstream layers to re-balance skewed attention distributions and enhance knowledge abstraction. Our findings demonstrate that our approach not only facilitates a deeper understanding of the internal dynamics of LLMs but also significantly improves their reasoning capabilities, particularly in non-STEM domains. Our study lays the groundwork for further innovations in LLM design, aiming to create more powerful, versatile, and responsible models capable of tackling a broad range of real-world applications.

Extending Token Computation for LLM Reasoning

TL;DR

The paper addresses attention inefficiencies in LLM reasoning by examining token computation in extended Chain-of-Thought (CoT) and identifying skew from non-semantic tokens after domain-specific fine-tuning. It introduces a top-layer-guided, training-free attention optimization that emulates early-layer patterns across downstream layers, formalized as for all , to rebalance attention and improve knowledge abstraction. Evaluations on LLaMA-2 with MMLU, PIQA, and SIQA show that larger models (13B) gain in Non-STEM reasoning and related tasks, while memory-dependent and STEM domains present trade-offs, suggesting extended CoT can enhance reasoning when managed carefully. Overall, the work advances understanding of internal LLM dynamics and offers a practical, training-free method to improve cross-domain reasoning, with implications for designing more capable and interpretable LLMs.

Abstract

Large Language Models (LLMs) are pivotal in advancing natural language processing but often struggle with complex reasoning tasks due to inefficient attention distributions. In this paper, we explore the effect of increased computed tokens on LLM performance and introduce a novel method for extending computed tokens in the Chain-of-Thought (CoT) process, utilizing attention mechanism optimization. By fine-tuning an LLM on a domain-specific, highly structured dataset, we analyze attention patterns across layers, identifying inefficiencies caused by non-semantic tokens with outlier high attention scores. To address this, we propose an algorithm that emulates early layer attention patterns across downstream layers to re-balance skewed attention distributions and enhance knowledge abstraction. Our findings demonstrate that our approach not only facilitates a deeper understanding of the internal dynamics of LLMs but also significantly improves their reasoning capabilities, particularly in non-STEM domains. Our study lays the groundwork for further innovations in LLM design, aiming to create more powerful, versatile, and responsible models capable of tackling a broad range of real-world applications.
Paper Structure (14 sections, 1 equation, 10 figures, 1 table)

This paper contains 14 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Illustration of the big picture of our method.
  • Figure 2: Visualization of attention score matrices from the fine-tuned language model. (Left) Attention score matrix from layer 6 of the fine-tuned language model, providing an overview of the attention patterns learned at this depth. (Right) Zoomed-in view of attention scores from the first and middle layers of the model, offering a more detailed view of the attention dynamics within these layers.
  • Figure 3: Illustration of attention score matrix modifications in preliminary tests on the fine-tuned language model. The attention score matrix is segmented into the prompt and dialogue parts. (Left) The original attention score matrix, with dark blue cells representing anchor tokens. (Middle) Modified attention score matrix with anchor tokens removed from the prompt-related token range. (Right) Counter-experiment where only anchor tokens are retained in the prompt-related token range.
  • Figure 4: The figure illustrates the experiment where the prompt range attention scores were removed from alternating middle layers (4 to 8).
  • Figure 5: Comparative analysis of attention score matrices from the original fine-tuned LLM and the model fine-tuned with dropout. (Left) Attention score matrix from the original fine-tuned model. (Middle) Attention score matrix from the model fine-tuned with dropout. (Right) Matrix depicting the difference in attention scores between the original and dropout-regularized models, highlighting the impact of dropout on the learned attention patterns.
  • ...and 5 more figures