Table of Contents
Fetching ...

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

Boseung Jeong, Jicheol Park, Sungyeon Kim, Suha Kwak

TL;DR

AVIGATE tackles video-text retrieval by introducing a gated audio fusion mechanism that selectively leverages audio cues through a gated fusion transformer and an adaptive margin-based contrastive loss. The model employs three encoders (AST for audio, CLIP for video frames, CLIP for text) and a multi-layer gating function to suppress uninformative audio while exploiting informative cues, paired with a multi-grained alignment that combines global and local matching scores. The adaptive margin depends on intra-modal similarities, producing a discriminative cross-modal embedding space and improving generalization across MSR-VTT, VATEX, and Charades, all with efficient retrieval complexity $O(A+V+T)$. These contributions yield state-of-the-art results and demonstrate practical advantages in retrieval speed and robustness to noisy audio signals.

Abstract

Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

TL;DR

AVIGATE tackles video-text retrieval by introducing a gated audio fusion mechanism that selectively leverages audio cues through a gated fusion transformer and an adaptive margin-based contrastive loss. The model employs three encoders (AST for audio, CLIP for video frames, CLIP for text) and a multi-layer gating function to suppress uninformative audio while exploiting informative cues, paired with a multi-grained alignment that combines global and local matching scores. The adaptive margin depends on intra-modal similarities, producing a discriminative cross-modal embedding space and improving generalization across MSR-VTT, VATEX, and Charades, all with efficient retrieval complexity . These contributions yield state-of-the-art results and demonstrate practical advantages in retrieval speed and robustness to noisy audio signals.

Abstract

Video-text retrieval, the task of retrieving videos based on a textual query or vice versa, is of paramount importance for video understanding and multimodal information retrieval. Recent methods in this area rely primarily on visual and textual features and often ignore audio, although it helps enhance overall comprehension of video content. Moreover, traditional models that incorporate audio blindly utilize the audio input regardless of whether it is useful or not, resulting in suboptimal video representation. To address these limitations, we propose a novel video-text retrieval framework, Audio-guided VIdeo representation learning with GATEd attention (AVIGATE), that effectively leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals. In addition, we propose an adaptive margin-based contrastive loss to deal with the inherently unclear positive-negative relationship between video and text, which facilitates learning better video-text alignment. Our extensive experiments demonstrate that AVIGATE achieves state-of-the-art performance on all the public benchmarks.

Paper Structure

This paper contains 23 sections, 9 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comparative illustration of different scenarios using visual-only, audio-video fusion, and our proposed gated fusion approach. (a) In cases where the audio signal provides valuable information, the audio-video fusion and our gated fusion achieve high similarity scores. (b) When the audio signal is misleading, traditional fusion methods degrade performance. In contrast, our gated fusion mechanism successfully filters irrelevant audio cues, maintaining a high similarity score like the visual-only case.
  • Figure 2: ($Left$) The overall architecture of AVIGATE. Audio input is processed through an Audio Spectrogram Transformer (AST) and further refined by an audio resampler to generate fixed-size audio embeddings. Frame embeddings are derived from the video using a CLIP Image Encoder, while the text embedding is extracted by the CLIP Text Encoder. These audio and frame embeddings are fused by a gated fusion transformer, which dynamically determines the contribution of audio. The final video representation is aligned with the text embedding using a multi-grained alignment scheme, facilitating an effective video-text retrieval process. ($Right$) The gated fusion transformer consists of a gated fusion block and a gating function.
  • Figure 3: Top-1 text-to-video retrieval results of our method on MSR-VTT, where they are true matches. $g_{mha}^{(l)}$ and $g_{ffn}^{(l)}$ denote the gating scores for $l$-th layers of the gated fusion transformer. The audio provides informative cues for accurate retrieval, where "a man is talking" in the query text is not visible (a). The irrelevant audio is filtered by the gated fusion transformer, leading to an accurate retrieval result (b).
  • Figure 4: The overall architecture of audio resampler.
  • Figure 5: Top-1 text-to-video retrieval results of our method on MSR-VTT, where they are true matches. The audio provides informative cues for accurate retrieval, where "a man is talking" in the query text is not visible (a) and "talk$\cdots$san diego" in the query text is not visible but audible (b). However, neglecting these informative audio signals (i.e., w/o Audio) fails to retrieve true matches. Meanwhile, the irrelevant audio is filtered by the gated fusion transformer, leading to accurate retrieval results (c) and (d); without the gating mechanism (i.e., w/o Gate), it leads to retrieving false matches due to the irrelevant audio.