Table of Contents
Fetching ...

Attention Bottlenecks for Multimodal Fusion

Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

TL;DR

The paper introduces Multimodal Bottleneck Transformer (MBT), a transformer-based architecture for audiovisual fusion that constrains cross-modal information exchange through a small set of fusion bottlenecks, improving fusion efficiency and performance. MBT explores vanilla, modality-specific, and bottleneck fusion strategies, with mid-layer fusion (mid-fusion) and bottlenecks providing the best trade-off between accuracy and compute. It achieves state-of-the-art results on AudioSet, Epic-Kitchens, and VGGSound, including a notable 5.9 mAP improvement on mini AudioSet, and provides insights via attention visualizations. The approach reduces quadratic cross-attention costs while delivering robust cross-modal integration across diverse video datasets, highlighting practical impact for robust audiovisual understanding.

Abstract

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

Attention Bottlenecks for Multimodal Fusion

TL;DR

The paper introduces Multimodal Bottleneck Transformer (MBT), a transformer-based architecture for audiovisual fusion that constrains cross-modal information exchange through a small set of fusion bottlenecks, improving fusion efficiency and performance. MBT explores vanilla, modality-specific, and bottleneck fusion strategies, with mid-layer fusion (mid-fusion) and bottlenecks providing the best trade-off between accuracy and compute. It achieves state-of-the-art results on AudioSet, Epic-Kitchens, and VGGSound, including a notable 5.9 mAP improvement on mini AudioSet, and provides insights via attention visualizations. The approach reduces quadratic cross-attention costs while delivering robust cross-modal integration across diverse video datasets, highlighting practical impact for robust audiovisual understanding.

Abstract

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

Paper Structure

This paper contains 30 sections, 9 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Cross-modal Fusion. Unlike late fusion (left), where no cross-modal information is exchanged in the model until after the classifier, we investigate two pathways for the exchange of cross-modal information. The first is via standard pairwise self attention across all hidden units in a layer, but applied only to later layers in the model -- mid fusion (middle, left). We also propose the use of 'fusion bottlenecks' (middle, right) that restrict attention flow within a layer through tight latent units. Both forms of restriction can be applied in conjunction (Bottleneck Mid Fusion) for optimal performance (right). We show $B=2$ bottleneck units and 3 hidden units per modality. Grey boxes indicate tokens that receive attention flow from both audio and video tokens.
  • Figure 2: A Multimodal Fusion Transformer applied to audiovisual inputs. The input sequence consists of image and spectrogram patches. These are then projected into tokens and appended to special CLS (classification) and FSN (fusion bottleneck) tokens. Our transformer encoder then uses self attention to model unimodal information, and restricts cross-modal information flow via cross attention with the bottleneck tokens at multiple layers of the network.
  • Figure 3: The impact of using attention bottlenecks for fusion on performance (left) and compute (right) at different fusion layers $L_f$ on AudioSet, using clip span $t=4$ and $B=4$ bottleneck tokens. Attention bottlenecks improve performance at lower computational cost.
  • Figure 4: The effect of varying input clip span $t$ on the AudioSet test set.
  • Figure 5: The effect of training data size on the AudioSet test set.
  • ...and 6 more figures