Head-wise Shareable Attention for Large Language Models

Zouying Cao; Yifei Yang; Hai Zhao

Head-wise Shareable Attention for Large Language Models

Zouying Cao, Yifei Yang, Hai Zhao

TL;DR

A perspective on head-wise shareable attention for large language models is presented and two memory-efficient methods that share parameters across attention heads, with a specific focus on LLMs are proposed.

Abstract

Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices. Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop. However, current weight sharing techniques primarily focus on small-scale models like BERT and employ coarse-grained sharing rules, e.g., layer-wise. This becomes limiting given the prevalence of LLMs and sharing an entire layer or block obviously diminishes the flexibility of weight sharing. In this paper, we present a perspective on head-wise shareable attention for large language models. We further propose two memory-efficient methods that share parameters across attention heads, with a specific focus on LLMs. Both of them use the same dynamic strategy to select the shared weight matrices. The first method directly reuses the pre-trained weights without retraining, denoted as $\textbf{DirectShare}$. The second method first post-trains with constraint on weight matrix similarity and then shares, denoted as $\textbf{PostShare}$. Experimental results reveal our head-wise shared models still maintain satisfactory capabilities, demonstrating the feasibility of fine-grained weight sharing applied to LLMs.

Head-wise Shareable Attention for Large Language Models

TL;DR

Abstract

. The second method first post-trains with constraint on weight matrix similarity and then shares, denoted as

. Experimental results reveal our head-wise shared models still maintain satisfactory capabilities, demonstrating the feasibility of fine-grained weight sharing applied to LLMs.

Paper Structure (40 sections, 6 equations, 8 figures, 22 tables)

This paper contains 40 sections, 6 equations, 8 figures, 22 tables.

Introduction
Related Works
Memory-efficient Approaches for LLMs
Weight Sharing
Motivation and Empirical Analysis
Attention Map Similarity: From Layer-wise to Head-wise
From Attention Map Similarity to Weight Matrix Similarity
Head-wise Shareable Attention
Head-wise Weight Sharing Strategy
DirectShare
PostShare
Experiments
Experimental Settings
Main Results
Evaluation on DirectShare
...and 25 more sections

Figures (8)

Figure 1: (a) Layer-wise Attention Map Similarity. Taking the last layer as an example, the most similar attention layer with it is marked with $\surd$. (b) Head-wise Attention Map Similarity. $\surd$ mark the top n heads whose attention maps that are most similar to the 6-th head in the last layer(n=the number of heads per layer). (c) Weight Matrix Similarity. $\bigcirc$ mark the connection between attention map similarity and weight similarity.
Figure 2: ① DirectShare: Inspired by attention map reuse, directly share weight matrices across different heads based on cosine similarity; ② PostShare: To balance the memory usage and the performance, implement post-training with the constraint of weight matrix similarity and then share.
Figure 3: Experiments performed on PIQA and OpenBookQA using different head-wise match functions for Baichuan2-7B model.
Figure 4: DirectShare using Head-wise Weight Sharing Strategy
Figure 5: Performance of DirectShare across different subjects based on Llama2-7B on C-Eval and MMLU.
...and 3 more figures

Head-wise Shareable Attention for Large Language Models

TL;DR

Abstract

Head-wise Shareable Attention for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)