Table of Contents
Fetching ...

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, Marin Soljačić

TL;DR

This work tackles the high computational cost of attention in Transformer models by introducing Attention Pruning (AP), a data-informed method that derives a global sparsity mask from fixed training data to prune attention patterns without changing model parameters. AP computes per-layer, per-type attention averages, applies a percentile-based threshold to create a masking matrix, and retrains with these masks, enabling substantial reductions in attention computations across language modeling, machine translation, and GLUE benchmarks. The method reveals important distinctions between self-attention and cross-attention in terms of robustness to pruning, and demonstrates practical hardware benefits via block-sparse GPU kernels and memory savings on large models like SQuAD, BERT, and Llama2-7B. Overall, AP offers a simple, application-agnostic mechanism to reduce latency and memory requirements while preserving performance, and it highlights the potential for co-design with hardware to maximize efficiency in future NLP systems.

Abstract

Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

TL;DR

This work tackles the high computational cost of attention in Transformer models by introducing Attention Pruning (AP), a data-informed method that derives a global sparsity mask from fixed training data to prune attention patterns without changing model parameters. AP computes per-layer, per-type attention averages, applies a percentile-based threshold to create a masking matrix, and retrains with these masks, enabling substantial reductions in attention computations across language modeling, machine translation, and GLUE benchmarks. The method reveals important distinctions between self-attention and cross-attention in terms of robustness to pruning, and demonstrates practical hardware benefits via block-sparse GPU kernels and memory savings on large models like SQuAD, BERT, and Llama2-7B. Overall, AP offers a simple, application-agnostic mechanism to reduce latency and memory requirements while preserving performance, and it highlights the potential for co-design with hardware to maximize efficiency in future NLP systems.

Abstract

Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.

Paper Structure

This paper contains 24 sections, 6 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Attention pruning maintains performance and reduced attention computations. Pruning can often enable more efficient and interpretable models with only a modest decrease in performance.
  • Figure 2: Transformer-XL pruning masks (binary valued) averaged over all layers and attention heads for $p \in \{30\%, 90\%\}$. AP prunes entries in the left half (attention to past sequences) more aggressively than the conventional self-attention entries in the right half. Note that the right half also has an auto-regressive mask.
  • Figure 3: IWSLT14 de-en train dataset attention patterns: (\ref{['fig:average_iwslt14_enc_dec_entmaxFalse']}) cross-attention with variable context window, (\ref{['fig:average_iwslt14_self_enc_entmaxTrue']}) encoder self-attention with 1.5-entmax activation for sharper patterns, and (\ref{['fig:average_iwslt14_self_enc_entmaxFalse']}) encoder self-attention with constant context window.
  • Figure 4: Relative accuracy when a GLUE task (indicated in the rows: STS-B, CoLA, MRPC, RTE) is trained with AP ($p=40$) using the attention patterns of GLUE tasks (indicated in the columns: all GLUE tasks). The relative accuracy is computed so that the in-domain experiment is zero, and the out-of-domain experiments show deviations in accuracy.