Table of Contents
Fetching ...

GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval

Bowen Yang, Yun Cao, Chen He, Xiaosu Su

TL;DR

This work addresses text-to-video retrieval by introducing frame-level gated audio-visual fusion (FGF) guided by text and a semantic variance-scaled perturbation (SVSP) to regularize text embeddings. FGF enables fine-grained temporal grounding by computing per-frame gates and selectively weighting audio versus visual cues, while SVSP scales perturbations along semantic-variance dimensions to maintain semantic consistency and enable single-pass inference. The model is trained with a contrastive objective on perturbed text and cross-modal video embeddings, plus a weakly supervised refinement term to stabilize margins. Across MSR-VTT, DiDeMo, LSMDC, and VATEX, GAIS achieves state-of-the-art results with favorable efficiency, demonstrating improved cross-modal alignment and robust retrieval performance in audio-rich video scenarios.

Abstract

Text-to-video retrieval requires precise alignment between language and temporally rich audio-video signals. However, existing methods often emphasize visual cues while underutilizing audio semantics or relying on coarse fusion strategies, resulting in suboptimal multimodal representations. We introduce GAIS, a retrieval framework that strengthens multimodal alignment from both representation and regularization perspectives. First, a Frame-level Gated Fusion (FGF) module adaptively integrates audio-visual features under textual guidance, enabling fine-grained temporal selection of informative frames. Second, a Semantic Variance-Scaled Perturbation (SVSP) mechanism regularizes the text embedding space by controlling perturbation magnitude in a semantics-aware manner. These two modules are complementary: FGF minimizes modality gaps through selective fusion, while SVSP improves embedding stability and discrimination. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX demonstrate that GAIS consistently outperforms strong baselines across multiple retrieval metrics while maintaining notable computational efficiency.

GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval

TL;DR

This work addresses text-to-video retrieval by introducing frame-level gated audio-visual fusion (FGF) guided by text and a semantic variance-scaled perturbation (SVSP) to regularize text embeddings. FGF enables fine-grained temporal grounding by computing per-frame gates and selectively weighting audio versus visual cues, while SVSP scales perturbations along semantic-variance dimensions to maintain semantic consistency and enable single-pass inference. The model is trained with a contrastive objective on perturbed text and cross-modal video embeddings, plus a weakly supervised refinement term to stabilize margins. Across MSR-VTT, DiDeMo, LSMDC, and VATEX, GAIS achieves state-of-the-art results with favorable efficiency, demonstrating improved cross-modal alignment and robust retrieval performance in audio-rich video scenarios.

Abstract

Text-to-video retrieval requires precise alignment between language and temporally rich audio-video signals. However, existing methods often emphasize visual cues while underutilizing audio semantics or relying on coarse fusion strategies, resulting in suboptimal multimodal representations. We introduce GAIS, a retrieval framework that strengthens multimodal alignment from both representation and regularization perspectives. First, a Frame-level Gated Fusion (FGF) module adaptively integrates audio-visual features under textual guidance, enabling fine-grained temporal selection of informative frames. Second, a Semantic Variance-Scaled Perturbation (SVSP) mechanism regularizes the text embedding space by controlling perturbation magnitude in a semantics-aware manner. These two modules are complementary: FGF minimizes modality gaps through selective fusion, while SVSP improves embedding stability and discrimination. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX demonstrate that GAIS consistently outperforms strong baselines across multiple retrieval metrics while maintaining notable computational efficiency.

Paper Structure

This paper contains 19 sections, 9 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Illustration of Frame-level Gated Fusion. (a) When audio contains salient semantic cues, the gate assigns higher weights to audio across relevant frames. (b) When the audio is dominated by background noise, the gate suppresses audio contributions, preventing irrelevant signals from affecting retrieval.
  • Figure 2: Overview of GAIS. Given video frames, audio, and a text query, Frame-level Gated Fusion (FGF) adaptively integrates audio-visual features conditioned on text. The fused features are enhanced via text-video cross-attention and fed into the Semantic Variance-Scaled Perturbation (SVSP) module. Training uses stochastic perturbation for regularization, while inference employs a single deterministic pass for efficiency. $G$ represents the frame-level gated feature matrix (batch$\times$frames).
  • Figure 3: Text-guided frame-level audio–visual gating. FGF assigns high gate values to frames whose audio signals align with the query semantics, and suppresses uninformative or noisy segments, illustrating its fine-grained and interpretable fusion behavior.
  • Figure 4: Comparison of perturbation distributions in the text embedding space. Left: Fixed-magnitude stochastic perturbation produces a dispersed isotropic distribution.Right: SVSP scales perturbation by semantic variance, forming a compact ellipsoidal neighborhood around the original embedding (black ‘X’).
  • Figure 5: Per-audio-category retrieval performance (R@1) on MSR-VTT and DiDeMo. The audio classification was performed by the YAMNet model gemmeke2017audioset.
  • ...and 5 more figures