Table of Contents
Fetching ...

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, Yingyan Celine Lin

TL;DR

This work systematically reveals that attention sinks are not confined to the initial token but can occur throughout input sequences, and that not all sinks are beneficial. It introduces ACT, a training-free, inference-time calibration method that identifies which attention heads to calibrate and then adjusts attention distributions to suppress non-semantic sinks while preserving meaningful token relationships. Across a broad set of models and tasks, ACT yields consistent accuracy gains (up to 7.3% on MMLU and up to 13.26% on Hellaswag) and improves open-ended QA and multi-round conversations, demonstrating a practical, model-agnostic knob for enhancing LLM performance without weight finetuning. The findings provide both granular insights into attention behavior and a scalable technique to harness it, potentially complementing prompting and in-context learning.

Abstract

Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially regarding how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sink in the initial token, which receives disproportionately large attention scores despite their lack of semantic importance, this work delves deeper into this phenomenon. We aim to provide a more profound understanding of the existence of attention sinks within LLMs and to uncover ways to enhance the achievable accuracy of LLMs by directly optimizing the attention distributions, without the need for weight finetuning. Specifically, this work begins with comprehensive visualizations of the attention distributions in LLMs during inference across various inputs and tasks. Based on these visualizations, to the best of our knowledge, we are the first to discover that (1) attention sinks occur not only at the start of sequences but also within later tokens of the input, and (2) not all attention sinks have a positive impact on the achievable accuracy of LLMs. Building upon our findings, we propose a training-free Attention Calibration Technique (ACT) that automatically optimizes the attention distributions on the fly during inference in an input-adaptive manner. Extensive experiments validate that ACT consistently enhances the accuracy of various LLMs across different applications. Specifically, ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B. Our code is available at https://github.com/GATECH-EIC/ACT.

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

TL;DR

This work systematically reveals that attention sinks are not confined to the initial token but can occur throughout input sequences, and that not all sinks are beneficial. It introduces ACT, a training-free, inference-time calibration method that identifies which attention heads to calibrate and then adjusts attention distributions to suppress non-semantic sinks while preserving meaningful token relationships. Across a broad set of models and tasks, ACT yields consistent accuracy gains (up to 7.3% on MMLU and up to 13.26% on Hellaswag) and improves open-ended QA and multi-round conversations, demonstrating a practical, model-agnostic knob for enhancing LLM performance without weight finetuning. The findings provide both granular insights into attention behavior and a scalable technique to harness it, potentially complementing prompting and in-context learning.

Abstract

Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially regarding how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sink in the initial token, which receives disproportionately large attention scores despite their lack of semantic importance, this work delves deeper into this phenomenon. We aim to provide a more profound understanding of the existence of attention sinks within LLMs and to uncover ways to enhance the achievable accuracy of LLMs by directly optimizing the attention distributions, without the need for weight finetuning. Specifically, this work begins with comprehensive visualizations of the attention distributions in LLMs during inference across various inputs and tasks. Based on these visualizations, to the best of our knowledge, we are the first to discover that (1) attention sinks occur not only at the start of sequences but also within later tokens of the input, and (2) not all attention sinks have a positive impact on the achievable accuracy of LLMs. Building upon our findings, we propose a training-free Attention Calibration Technique (ACT) that automatically optimizes the attention distributions on the fly during inference in an input-adaptive manner. Extensive experiments validate that ACT consistently enhances the accuracy of various LLMs across different applications. Specifically, ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B. Our code is available at https://github.com/GATECH-EIC/ACT.
Paper Structure (19 sections, 2 equations, 8 figures, 11 tables)

This paper contains 19 sections, 2 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Upper: Visualization of the averaged attention maps across all heads and layers of Llama2-7B-chat on different datasets. Lower: Visualization of the averaged attention maps across all heads in each layer when processing a sample from SST2 with Llama2-7B-chat. Identified attention sinks in the averaged attention map from SST2 are bounded with green boxes.
  • Figure 2: Attention score distribution of the initial token (i.e., the attention sink observed in StreamLLM xiao2023efficient), non-initial high attention tokens, and other tokens for classification tasks (top) and multiple-choice tasks (bottom).
  • Figure 3: Visualization of accuracy improvement in the MMLU dataset hendrycks2020measuring achieved by reducing the attention score of attention sinks in the middle of input sequences for each individual head separately.
  • Figure 4: Visualization on the model's averaged attention map before (left) and after (right) our proposed ACT.
  • Figure 5: Histogram of the positions of attention sinks throughout all 17 datasets used in our paper.
  • ...and 3 more figures