Table of Contents
Fetching ...

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

TL;DR

MA-AVT addresses the challenge of parameter-efficient audio-visual learning by aligning modalities through a frozen ViT backbone augmented with learnable unimodal and shared tokens. It introduces blockwise semantic contrastive learning to supervise hierarchical cross-modal features and a robust foreground mining mechanism to suppress background noise, achieving deeper modality alignment. The approach yields substantial improvements on AVE, VGGSound, and CREMA-D over state-of-the-art methods while maintaining low trainable parameters. This work advances practical, scalable audio-visual recognition by enabling effective cross-modal learning with limited fine-tuning and enhanced foreground-background discrimination.

Abstract

Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features throughout the encoding phase. Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE, VGGSound, and CREMA-D datasets, we achieve considerable performance improvements over SOTA methods.

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

TL;DR

MA-AVT addresses the challenge of parameter-efficient audio-visual learning by aligning modalities through a frozen ViT backbone augmented with learnable unimodal and shared tokens. It introduces blockwise semantic contrastive learning to supervise hierarchical cross-modal features and a robust foreground mining mechanism to suppress background noise, achieving deeper modality alignment. The approach yields substantial improvements on AVE, VGGSound, and CREMA-D over state-of-the-art methods while maintaining low trainable parameters. This work advances practical, scalable audio-visual recognition by enabling effective cross-modal learning with limited fine-tuning and enhanced foreground-background discrimination.

Abstract

Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features throughout the encoding phase. Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE, VGGSound, and CREMA-D datasets, we achieve considerable performance improvements over SOTA methods.
Paper Structure (24 sections, 5 equations, 3 figures, 5 tables)

This paper contains 24 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A visual image contains sounding foreground object regions, as well as silent background regions. MA-AVT aims to align the foreground visual features with corresponding audio features. Simultaneously, MA-AVT learns mismatched uni-modal features to enhance cross-modal contrast. In particular, MA-AVT leverages a pre-trained frozen vision transformer in audio-visual tasks with learnable uni-modal and shared cross-modal tokens.
  • Figure 2: The overview of the proposed MA-AVT framework. The image and audio spectrogram are processed simultaneously with frozen transformer encoders. Initially, we extract patch tokens using pre-trained patch extractors of transformers. We introduce learnable unimodal audio and visual tokens to learn unique unimodal representation as well as introduce multimodal shared tokens to learn joint representation. To focus on most relevant tokens for the target class, we introduce local self-attention (LSA) modules on each group of tokens. To further enhance the modality alignment, we operate blockwise semantic contrastive learning on the intermediate shared multimodal token embeddings after each transformer block. To suppress mismatching background regions, we introduce learnable background (BG) and foreground (FG) class tokens. Here, $\mathcal{L}_{bf}$ denotes foreground-background loss and $\mathcal{L}_{cnt}^k$ denotes contrastive loss after each $k^{th}$ block.
  • Figure 3: Grad-CAM visualization for qualitative comparison. Here, red color denotes high attention values and blue color denotes low attention values. Modality alignment brings noticeable improvements in MA-AVT to put more attention on the sounding regions. In general, MA-AVT better discovers the target visual sounding regions with sharper boundaries compared to other competitive baselines. Moreover, MA-AVT significantly reduces the attention weights on the silent regions.