Table of Contents
Fetching ...

The silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures

Andrea Diecidue, Carlo Alberto Barbano, Piero Fraternali, Mathieu Fontaine, Enzo Tartaglione

TL;DR

The paper addresses the high parameter and compute burden of attention-based audio transformers by introducing a structured pruning framework that decouples pruning of the Q, K, V, and O projections and explores head- and channel-level strategies. It compares magnitude pruning and Fisher information for ranking, and evaluates global versus local thresholding on the Audio Spectrogram Transformer (AST) across Audioset and SpeechCommands, using iterative pruning with LoRA fine-tuning. The key finding is that up to 50% of attention parameters can be pruned with less than 1% absolute performance loss, with Fisher information-based ranking and global thresholds providing the strongest trade-offs. This approach enables substantial efficiency gains for audio tasks and has potential applicability to other domains such as computer vision and natural language processing.

Abstract

Transformer-based models have become the state of the art across multiple domains, from natural language processing to machine listening, thanks to attention mechanisms. However, the attention layers require a large number of parameters and high-end hardware for both training and inference. We propose a novel pruning technique targeted explicitly at the attention mechanism, where we decouple the pruning of the four layers in the attention block, namely: query, keys, values and outputs' projection matrices. We also investigate pruning strategies to prune along the head and channel dimensions, and compare the performance of the Audio Spectrogram Transformer (AST) model under different pruning scenarios. Our results show that even by pruning 50\% of the attention parameters we incur in performance degradation of less than 1\%

The silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures

TL;DR

The paper addresses the high parameter and compute burden of attention-based audio transformers by introducing a structured pruning framework that decouples pruning of the Q, K, V, and O projections and explores head- and channel-level strategies. It compares magnitude pruning and Fisher information for ranking, and evaluates global versus local thresholding on the Audio Spectrogram Transformer (AST) across Audioset and SpeechCommands, using iterative pruning with LoRA fine-tuning. The key finding is that up to 50% of attention parameters can be pruned with less than 1% absolute performance loss, with Fisher information-based ranking and global thresholds providing the strongest trade-offs. This approach enables substantial efficiency gains for audio tasks and has potential applicability to other domains such as computer vision and natural language processing.

Abstract

Transformer-based models have become the state of the art across multiple domains, from natural language processing to machine listening, thanks to attention mechanisms. However, the attention layers require a large number of parameters and high-end hardware for both training and inference. We propose a novel pruning technique targeted explicitly at the attention mechanism, where we decouple the pruning of the four layers in the attention block, namely: query, keys, values and outputs' projection matrices. We also investigate pruning strategies to prune along the head and channel dimensions, and compare the performance of the Audio Spectrogram Transformer (AST) model under different pruning scenarios. Our results show that even by pruning 50\% of the attention parameters we incur in performance degradation of less than 1\%

Paper Structure

This paper contains 7 sections, 1 equation, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Scheme of the self-attention block.
  • Figure 2: Schemes of Same Channel, Per Head and Entire Head pruning patterns, in order.
  • Figure 3: Sparsity in the attention blocks of each layer for 20% pruning for Entire Head pruning scheme with Global threshold approach on the SpeechCommands dataset. Layers are ordered from the input to the output of the network.
  • Figure 4: Inference speed in seconds on the SpeechCommands dataset at different pruning levels for the Entire Head pruning scheme.