GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval
Bowen Yang, Yun Cao, Chen He, Xiaosu Su
TL;DR
This work addresses text-to-video retrieval by introducing frame-level gated audio-visual fusion (FGF) guided by text and a semantic variance-scaled perturbation (SVSP) to regularize text embeddings. FGF enables fine-grained temporal grounding by computing per-frame gates and selectively weighting audio versus visual cues, while SVSP scales perturbations along semantic-variance dimensions to maintain semantic consistency and enable single-pass inference. The model is trained with a contrastive objective on perturbed text and cross-modal video embeddings, plus a weakly supervised refinement term to stabilize margins. Across MSR-VTT, DiDeMo, LSMDC, and VATEX, GAIS achieves state-of-the-art results with favorable efficiency, demonstrating improved cross-modal alignment and robust retrieval performance in audio-rich video scenarios.
Abstract
Text-to-video retrieval requires precise alignment between language and temporally rich audio-video signals. However, existing methods often emphasize visual cues while underutilizing audio semantics or relying on coarse fusion strategies, resulting in suboptimal multimodal representations. We introduce GAIS, a retrieval framework that strengthens multimodal alignment from both representation and regularization perspectives. First, a Frame-level Gated Fusion (FGF) module adaptively integrates audio-visual features under textual guidance, enabling fine-grained temporal selection of informative frames. Second, a Semantic Variance-Scaled Perturbation (SVSP) mechanism regularizes the text embedding space by controlling perturbation magnitude in a semantics-aware manner. These two modules are complementary: FGF minimizes modality gaps through selective fusion, while SVSP improves embedding stability and discrimination. Extensive experiments on MSR-VTT, DiDeMo, LSMDC, and VATEX demonstrate that GAIS consistently outperforms strong baselines across multiple retrieval metrics while maintaining notable computational efficiency.
