Table of Contents
Fetching ...

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

Ruixiang Zhao, Zhihao Xu, Bangxiang Lan, Zijie Xin, Jingyu Liu, Xirong Li

TL;DR

SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding, and introduces soft-ALBEF for early vision-audio alignment that facilitates fusion.

Abstract

For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

TL;DR

SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding, and introduces soft-ALBEF for early vision-audio alignment that facilitates fusion.

Abstract

For video-text retrieval, the use of CLIP has been a de facto choice. Since CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: ineffective representation of speech content and suboptimal vision-audio fusion. To address these issues jointly, we propose SAVE, a Speech Aware Video rEpresentation learning method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-9k, +1.9% on MSRVTT-7k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.
Paper Structure (16 sections, 3 equations, 8 figures, 9 tables)

This paper contains 16 sections, 3 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: An overview of this paper. (a) Problem: Current audio encoders (ResNet-18resnet and AST ast), trained on datasets of environmental sounds, are not well suited for speech embedding. (b) Solution: We improve the state-of-the-art in audio-enhanced video-text retrieval by introducing a dedicated speech branch for speech-aware video embedding. (c) Results: Our method consistently outperforms SOTA audio-enhanced models (TEFAL tefal and AVIGATE avigate) across five benchmarks.
  • Figure 2: Hard vs. soft labels for early vision-audio alignment. For the videos in the first two rows, their associated sound tracks are not semantically relevant w.r.t. the video content. Enforcing vision-audio alignment for these videos is adverse. By contrast, the soft labels, estimated by ImageBind imagebind, provide finer supervision for better vision-audio alignment.
  • Figure 3: Proposed speech-aware video representation learning (SAVE) method for video-text retrieval. Given a short video $\mathcal{V}$ associated with a sound track $\mathcal{A}$, SAVE uses a tri-branch network to embed the video frames to a set of visual tokens $\{v_i\}$, the sound to a set of audio tokens $\{a_i\}$, and the speech to a set of textual tokens $\{s_i\}$. Gated fusion conditioned on the visual tokens is performed on the audio and textual tokens, yielding fused tokens $\{\hat{a}_i\}$ and $\{\hat{s}_i\}$, respectively. By aggregating $\{v_i\}$, $\{\hat{a}_i\}$ and $\{\hat{s}_i\}$, we obtain $\{\hat{v}_i\}$ as a speech-aware video representation. The video's relevance w.r.t. a specific query $\mathcal{T}$ is computed as a multi-grained similarity between $\{\hat{v}_i\}$ and the query embedding $t$. Soft-ALBEF is used only during training for better alignment between the visual and audio tokens.
  • Figure 4: Proportion of videos with audio / ASR available.
  • Figure 5: Comparison per group. Our SAVE consistently outperforms PIG (best visual model) and AVIGATE (best audiovisual model) across all groups, with the largest gain obtained in the Sound-Speech-related group.
  • ...and 3 more figures