Table of Contents
Fetching ...

Density Adaptive Attention is All You Need: Robust Parameter-Efficient Fine-Tuning Across Multiple Modalities

Georgios Ioannides, Aman Chadha, Aaron Elkins

TL;DR

Addressing robust and efficient fine-tuning of large pre-trained models across speech, text, and vision, the paper introduces a density-based attention mechanism. Multi-Head DAAM and Density Adaptive Transformer embed learnable mean offsets and variance scales per Gaussian, enabling dynamic recalibration of feature importance. The approach achieves notable improvements over standard dot-product attention and PEFT baselines, particularly on highly non-stationary data, with accuracy gains up to approximately +20% and cross-modal applicability. Additionally, it introduces the Importance Factor for interpretability and demonstrates strong parameter efficiency versus LoRA across models such as WavLM, Llama2, and BEiT, supported by released code.

Abstract

We propose the Multi-Head Density Adaptive Attention Mechanism (DAAM), a novel probabilistic attention framework that can be used for Parameter-Efficient Fine-tuning (PEFT), and the Density Adaptive Transformer (DAT), designed to enhance information aggregation across multiple modalities, including Speech, Text, and Vision. DAAM integrates learnable mean and variance into its attention mechanism, implemented in a multi-head framework, enabling it to collectively model any probability distribution for dynamic recalibration of feature significance. This method demonstrates significant improvements, especially with highly non-stationary data, surpassing the state-of-the-art attention techniques in model performance, up to approximately +20% (abs.) in accuracy. Empirically, DAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification, thereby establishing its robustness and versatility in handling data across multiple modalities. Furthermore, we introduce the Importance Factor, a new learning-based metric that enhances the explainability of models trained with DAAM-based methods.

Density Adaptive Attention is All You Need: Robust Parameter-Efficient Fine-Tuning Across Multiple Modalities

TL;DR

Addressing robust and efficient fine-tuning of large pre-trained models across speech, text, and vision, the paper introduces a density-based attention mechanism. Multi-Head DAAM and Density Adaptive Transformer embed learnable mean offsets and variance scales per Gaussian, enabling dynamic recalibration of feature importance. The approach achieves notable improvements over standard dot-product attention and PEFT baselines, particularly on highly non-stationary data, with accuracy gains up to approximately +20% and cross-modal applicability. Additionally, it introduces the Importance Factor for interpretability and demonstrates strong parameter efficiency versus LoRA across models such as WavLM, Llama2, and BEiT, supported by released code.

Abstract

We propose the Multi-Head Density Adaptive Attention Mechanism (DAAM), a novel probabilistic attention framework that can be used for Parameter-Efficient Fine-tuning (PEFT), and the Density Adaptive Transformer (DAT), designed to enhance information aggregation across multiple modalities, including Speech, Text, and Vision. DAAM integrates learnable mean and variance into its attention mechanism, implemented in a multi-head framework, enabling it to collectively model any probability distribution for dynamic recalibration of feature significance. This method demonstrates significant improvements, especially with highly non-stationary data, surpassing the state-of-the-art attention techniques in model performance, up to approximately +20% (abs.) in accuracy. Empirically, DAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification, thereby establishing its robustness and versatility in handling data across multiple modalities. Furthermore, we introduce the Importance Factor, a new learning-based metric that enhances the explainability of models trained with DAAM-based methods.
Paper Structure (15 sections, 4 equations, 3 figures, 6 tables, 4 algorithms)

This paper contains 15 sections, 4 equations, 3 figures, 6 tables, 4 algorithms.

Figures (3)

  • Figure 1: Proposed model architecture showcasing a pre-trained model (i.e., the encoder) for feature extraction (i.e., embeddings) via its $N$ transformer layers, followed by the attention module within the decoder network for selective emphasis, and concluding with probability output. The process flow is marked with the trainable and frozen states.
  • Figure 2: IF values for different processing tasks with their respective models (with output feature number on the X-axis and layer number on the Y-axis).
  • Figure 3: Percentage contribution of each layer to attention weights in different downstream tasks (for best performing DAAM-based models using $g : 8$).