Attention Bottlenecks for Multimodal Fusion

Arsha Nagrani; Shan Yang; Anurag Arnab; Aren Jansen; Cordelia Schmid; Chen Sun

Attention Bottlenecks for Multimodal Fusion

Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

TL;DR

The paper introduces Multimodal Bottleneck Transformer (MBT), a transformer-based architecture for audiovisual fusion that constrains cross-modal information exchange through a small set of fusion bottlenecks, improving fusion efficiency and performance. MBT explores vanilla, modality-specific, and bottleneck fusion strategies, with mid-layer fusion (mid-fusion) and bottlenecks providing the best trade-off between accuracy and compute. It achieves state-of-the-art results on AudioSet, Epic-Kitchens, and VGGSound, including a notable 5.9 mAP improvement on mini AudioSet, and provides insights via attention visualizations. The approach reduces quadratic cross-attention costs while delivering robust cross-modal integration across diverse video datasets, highlighting practical impact for robust audiovisual understanding.

Abstract

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

Attention Bottlenecks for Multimodal Fusion

TL;DR

Abstract

Attention Bottlenecks for Multimodal Fusion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)