Table of Contents
Fetching ...

EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models

Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh

TL;DR

Large Language Models incur high inference and fine-tuning costs as they grow. EchoAtt identifies high similarity of attention patterns in inner layers and shares attention matrices within blocks, paired with a two-stage knowledge distillation process to recover performance. The approach reduces parameters and speeds up both inference and training, achieving gains on TinyLLaMA-1.1B with minimal zero-shot performance loss. This work demonstrates a practical path to more efficient LLMs suitable for real-time and resource-constrained deployments, while preserving generalization capabilities.

Abstract

Large Language Models (LLMs), with their increasing depth and number of parameters, have demonstrated outstanding performance across a variety of natural language processing tasks. However, this growth in scale leads to increased computational demands, particularly during inference and fine-tuning. To address these challenges, we introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers. Our analysis reveals that many inner layers in LLMs, especially larger ones, exhibit highly similar attention matrices. By exploiting this similarity, EchoAtt enables the sharing of attention matrices in less critical layers, significantly reducing computational requirements without compromising performance. We incorporate this approach within a knowledge distillation setup, where a pre-trained teacher model guides the training of a smaller student model. The student model selectively shares attention matrices in layers with high similarity while inheriting key parameters from the teacher. Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15\%, training speed by 25\%, and reduces the number of parameters by approximately 4\%, all while improving zero-shot performance. These findings highlight the potential of attention matrix sharing to enhance the efficiency of LLMs, making them more practical for real-time and resource-limited applications.

EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models

TL;DR

Large Language Models incur high inference and fine-tuning costs as they grow. EchoAtt identifies high similarity of attention patterns in inner layers and shares attention matrices within blocks, paired with a two-stage knowledge distillation process to recover performance. The approach reduces parameters and speeds up both inference and training, achieving gains on TinyLLaMA-1.1B with minimal zero-shot performance loss. This work demonstrates a practical path to more efficient LLMs suitable for real-time and resource-constrained deployments, while preserving generalization capabilities.

Abstract

Large Language Models (LLMs), with their increasing depth and number of parameters, have demonstrated outstanding performance across a variety of natural language processing tasks. However, this growth in scale leads to increased computational demands, particularly during inference and fine-tuning. To address these challenges, we introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers. Our analysis reveals that many inner layers in LLMs, especially larger ones, exhibit highly similar attention matrices. By exploiting this similarity, EchoAtt enables the sharing of attention matrices in less critical layers, significantly reducing computational requirements without compromising performance. We incorporate this approach within a knowledge distillation setup, where a pre-trained teacher model guides the training of a smaller student model. The student model selectively shares attention matrices in layers with high similarity while inheriting key parameters from the teacher. Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15\%, training speed by 25\%, and reduces the number of parameters by approximately 4\%, all while improving zero-shot performance. These findings highlight the potential of attention matrix sharing to enhance the efficiency of LLMs, making them more practical for real-time and resource-limited applications.
Paper Structure (15 sections, 3 equations, 3 figures, 6 tables)

This paper contains 15 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Average cosine similarities between one layer's attention and other layers' attentions. The results demonstrate that attention scores in some layers are more similar than that of the other layers.
  • Figure 2: Average cosine similarities between the attention matrices of different layers in various LLMs, visualized as upper-triangle matrices. Each entry $[i, j]$ represents the similarity between the attention scores of layer i and layer j, with higher values indicating more similar attention mechanisms. The results highlight attention similarities in inner layers, suggesting potential for sharing attention mechanisms to reduce computational complexity.
  • Figure 3: (a) A standard transformer block, which consists of a single transformer layer. (b) A shared attention block, where multiple transformer layers utilize a single attention mechanism. (c) The architecture of the student and teacher models used in the proposed distillation method.