Table of Contents
Fetching ...

Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

Junan Lin, Daizong Liu, Xianke Chen, Xiaoye Qu, Xun Yang, Jixiang Zhu, Sanyuan Zhang, Jianfeng Dong

TL;DR

This work tackles video moment retrieval by introducing adaptive audio-vision-text fusion. The core method, IMG, uses an Audio Importance Predictor to weigh audio contributions and a Multi-Granularity Fusion module to integrate audio and visual cues at local, event, and global scales, with cross-modal knowledge distillation to maintain performance when audio is absent. The approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions and establishes strong performance on the newly released Charades-AudioMatter, demonstrating the practical value of selective audio integration. Overall, the framework enables robust, audio-aware VMR with improved generalization and resilience to noisy or missing audio.

Abstract

Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.

Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

TL;DR

This work tackles video moment retrieval by introducing adaptive audio-vision-text fusion. The core method, IMG, uses an Audio Importance Predictor to weigh audio contributions and a Multi-Granularity Fusion module to integrate audio and visual cues at local, event, and global scales, with cross-modal knowledge distillation to maintain performance when audio is absent. The approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions and establishes strong performance on the newly released Charades-AudioMatter, demonstrating the practical value of selective audio integration. Overall, the framework enables robust, audio-aware VMR with improved generalization and resilience to noisy or missing audio.

Abstract

Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at https://github.com/HuiGuanLab/IMG.

Paper Structure

This paper contains 33 sections, 10 equations, 11 figures, 17 tables.

Figures (11)

  • Figure 1: (Top) Audio is a critical modality, outweighing the importance of vision. (Bottom) Audio is entirely irrelevant and considered noise relative to the vision.
  • Figure 2: The framework of our proposed importance-aware multi-granularity fusion model for video moment retrieval.
  • Figure 3: Our proposed Multi-Granularity Fusion module: (a) Local-Level Fusion, (b) Event-Level Fusion, (c) Global-Level Fusion.
  • Figure 4: During inference, as noise in the audio progressively increases, the gap between the two curves in (a) widens, suggesting that the IMG model with AIP exhibits greater robustness. Additionally, as we expected, the average audio importance in (b) decreases as noise levels rise.
  • Figure 5: Performance of different granularity fusion strategies at different normalized moment-to-video ratios.
  • ...and 6 more figures