Table of Contents
Fetching ...

Unified Video-Language Pre-training with Synchronized Audio

Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

TL;DR

The paper addresses the challenge of video-language pre-training by explicitly incorporating synchronized audio to learn tri-modal representations. It introduces VLSA, a unified transformer that jointly processes video patches, text tokens, and audio spectrograms, employing Local-Patch Masked Modeling and Global Audio Matching to capture both local interactions and global cross-modal alignment. Trained on only 0.9M video–audio–text triplets, VLSA achieves state-of-the-art or competitive results on text–video, text–audio, and video–audio retrieval benchmarks, demonstrating strong data efficiency and the value of audio synchronization. The work suggests that a single, weight-sharing encoder with targeted masked modeling and audio-guided global matching can yield compact, discriminative cross-modal embeddings with practical retrieval impact.

Abstract

Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.

Unified Video-Language Pre-training with Synchronized Audio

TL;DR

The paper addresses the challenge of video-language pre-training by explicitly incorporating synchronized audio to learn tri-modal representations. It introduces VLSA, a unified transformer that jointly processes video patches, text tokens, and audio spectrograms, employing Local-Patch Masked Modeling and Global Audio Matching to capture both local interactions and global cross-modal alignment. Trained on only 0.9M video–audio–text triplets, VLSA achieves state-of-the-art or competitive results on text–video, text–audio, and video–audio retrieval benchmarks, demonstrating strong data efficiency and the value of audio synchronization. The work suggests that a single, weight-sharing encoder with targeted masked modeling and audio-guided global matching can yield compact, discriminative cross-modal embeddings with practical retrieval impact.

Abstract

Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.
Paper Structure (17 sections, 8 equations, 5 figures, 7 tables)

This paper contains 17 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison performance of text-to-video (Left), zero-shot text-to-video (Middle), and text-to-audio retrieval (Right). VLSA pre-trained on only 0.9M data achieves significant gains compared to previous video-language (MEE, HT, MMT, SupportSet, Frozen, OA-Trans, AllinOne), video-audio (TVLT), and video-language-audio (CE, VATT) methods.
  • Figure 2: Illustration of our enhanced framework of Video-Language pre-training with Synchronized Audio (VLSA). The modality-aware patch embeddings $\{\mathbf{x}_i^v\}_{i=1}^{VI}$, $\{\mathbf{x}_i^a\}_{i=1}^A$, $\{\mathbf{x}_i^t\}_{i=1}^S$, are extracted from each linear projection layer. The Local-Patch Masked Modeling module is applied to local-patch representations for audio spectrogram $\mathbf{a}$ extracted from the unified encoder, and the decoder is utilized to predict the raw audio spectrograms $\hat{\mathbf{a}}$ for learning the interaction of audio and the other two modalities (video and text). Finally, the Global Audio Matching module (contrastive loss and binary matching loss) is leveraged on modality-aware global embeddings $\hat{\mathbf{g}}^v, \hat{\mathbf{g}}^t, \hat{\mathbf{g}}^a$ averaged from the encoder to capture the cross-modal alignment between synchronized audio and video frames/caption sentence in an explicit manner.
  • Figure 3: Effect of modality types in a single joint encoder. A, V, and T denote audio, video, and text, respectively.
  • Figure 4: Effect of modality types with parameter-shared decoder. A, V, and T denote audio, video, and text, respectively.
  • Figure 5: Qualitative comparisons of visual-textual representations learned by VTM and GAM for matching (Top Row) and non-matching pairs (Bottom Row). Note that each spot denotes the visual/textual feature of one video/caption, and each color refers to one modality (yellow for video, green for text). The VLSA representations are much better.