Density Adaptive Attention is All You Need: Robust Parameter-Efficient Fine-Tuning Across Multiple Modalities
Georgios Ioannides, Aman Chadha, Aaron Elkins
TL;DR
Addressing robust and efficient fine-tuning of large pre-trained models across speech, text, and vision, the paper introduces a density-based attention mechanism. Multi-Head DAAM and Density Adaptive Transformer embed learnable mean offsets and variance scales per Gaussian, enabling dynamic recalibration of feature importance. The approach achieves notable improvements over standard dot-product attention and PEFT baselines, particularly on highly non-stationary data, with accuracy gains up to approximately +20% and cross-modal applicability. Additionally, it introduces the Importance Factor for interpretability and demonstrates strong parameter efficiency versus LoRA across models such as WavLM, Llama2, and BEiT, supported by released code.
Abstract
We propose the Multi-Head Density Adaptive Attention Mechanism (DAAM), a novel probabilistic attention framework that can be used for Parameter-Efficient Fine-tuning (PEFT), and the Density Adaptive Transformer (DAT), designed to enhance information aggregation across multiple modalities, including Speech, Text, and Vision. DAAM integrates learnable mean and variance into its attention mechanism, implemented in a multi-head framework, enabling it to collectively model any probability distribution for dynamic recalibration of feature significance. This method demonstrates significant improvements, especially with highly non-stationary data, surpassing the state-of-the-art attention techniques in model performance, up to approximately +20% (abs.) in accuracy. Empirically, DAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification, thereby establishing its robustness and versatility in handling data across multiple modalities. Furthermore, we introduce the Importance Factor, a new learning-based metric that enhances the explainability of models trained with DAAM-based methods.
