Table of Contents
Fetching ...

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Parthasaarathy Sudarsanam, Irene Martín-Morató, Tuomas Virtanen

TL;DR

This work tackles the challenge of trimodal alignment across audio, visual, and text. It introduces SLAVA, a single-stage contrastive learning framework that jointly optimizes $L_{vt}$, $L_{at}$, and $L_{av}$ using the AVCaps dataset, enabling unified representations for all three modalities. Empirically, SLAVA outperforms two-stage baselines, achieving substantial gains in audio-based visual retrieval (e.g., recall@10 ≈ $0.52$) and demonstrating strong cross-modal transfer through modality-specific captions. The results underscore the value of unified training with rich modality-specific supervision for robust video understanding and multimodal reasoning.

Abstract

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

TL;DR

This work tackles the challenge of trimodal alignment across audio, visual, and text. It introduces SLAVA, a single-stage contrastive learning framework that jointly optimizes , , and using the AVCaps dataset, enabling unified representations for all three modalities. Empirically, SLAVA outperforms two-stage baselines, achieving substantial gains in audio-based visual retrieval (e.g., recall@10 ≈ ) and demonstrating strong cross-modal transfer through modality-specific captions. The results underscore the value of unified training with rich modality-specific supervision for robust video understanding and multimodal reasoning.

Abstract

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.

Paper Structure

This paper contains 18 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Schematic representation of the two-stage reference method (a) and the proposed single-stage method (b). Our proposed method aligns the representations from the three modalities by jointly optimizing the visual-text ($L_{\text{vt}}$), audio-text ($L_{\text{at}}$), and audio-visual ($L_{\text{av}}$) contrastive losses.