Table of Contents
Fetching ...

Patch-level Sounding Object Tracking for Audio-Visual Question Answering

Zhangbin Li, Jinxing Zhou, Jing Zhang, Shengeng Tang, Kun Li, Dan Guo

TL;DR

This paper tackles audio-visual question answering by introducing PSOT, a patch-level sounding object tracking framework. It combines three graph-based key patch trackers—M-KPT guided by visual motion, S-KPT guided by audio-visual correspondence, and Q-KPT selecting question-relevant patches—followed by Multimodal Message Aggregation to predict answers. Across MUSIC-AVQA, PSOT achieves competitive accuracy with substantially less pretraining data and parameters than large-scale pretrained models, illustrating strong data efficiency and the value of patch-level, graph-based reasoning. The approach demonstrates robust performance across varying motion levels and provides interpretable visualizations of how motion and sound cues guide attention to informative patches.

Abstract

Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual correspondence map. The M-KPT and S-KPT modules are performed in parallel for each temporal segment, allowing balanced tracking of salient and sounding objects. Based on the tracked patches, we further propose a Question-driven KPT (Q-KPT) module to retain patches highly relevant to the question, ensuring the model focuses on the most informative clues. The audio-visual-question features are updated during the processing of these modules, which are then aggregated for final answer prediction. Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches.

Patch-level Sounding Object Tracking for Audio-Visual Question Answering

TL;DR

This paper tackles audio-visual question answering by introducing PSOT, a patch-level sounding object tracking framework. It combines three graph-based key patch trackers—M-KPT guided by visual motion, S-KPT guided by audio-visual correspondence, and Q-KPT selecting question-relevant patches—followed by Multimodal Message Aggregation to predict answers. Across MUSIC-AVQA, PSOT achieves competitive accuracy with substantially less pretraining data and parameters than large-scale pretrained models, illustrating strong data efficiency and the value of patch-level, graph-based reasoning. The approach demonstrates robust performance across varying motion levels and provides interpretable visualizations of how motion and sound cues guide attention to informative patches.

Abstract

Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual correspondence map. The M-KPT and S-KPT modules are performed in parallel for each temporal segment, allowing balanced tracking of salient and sounding objects. Based on the tracked patches, we further propose a Question-driven KPT (Q-KPT) module to retain patches highly relevant to the question, ensuring the model focuses on the most informative clues. The audio-visual-question features are updated during the processing of these modules, which are then aggregated for final answer prediction. Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches.

Paper Structure

This paper contains 17 sections, 10 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Illustration of the AVQA task and our main idea. (a) AVQA task requires accurate comprehension of sounding objects related to the question. (b) Our method explores the visual motion information and audio-visual correspondence to identify key salient and sounding patches. The question is then used to select the highly relevant patches.
  • Figure 2: Method Overview. 1) The M-KPT module tracks salient visual patches ($\star$) with large motion, which often relate to sounding objects and questions. It measures the patch-wise motion intensity information between neighboring frames, yielding the motion-activation matrix ($\bm{m}_t$) to guide adjacency matrix ($\mathcal{A}^{m}_{t}$) for motion-driven graph network ($\mathcal{G}^m_t$) learning. 2) Meanwhile, the S-KPT module tracks the sounding patches by assessing the audio-visual correspondence. A sound-driven graph network ($\mathcal{G}^s_t$) is constructed. 3) The Q-KPT module further processes visual patches highlighted by the M-KPT and S-KPT modules, retaining only those patches highly relevant to the question. This is also achieved in a graph ($\mathcal{G}^q_t$). 4) The MMA module integrates the question with enhanced audio&visual features through independent graph networks for answer prediction.
  • Figure 3: Visualization of our model's inference process for AVQA. The key patches (red boxes) are highlighted in each module, guided by sound, motion, and question. The brighter color indicates larger activation weights.
  • Figure 4: Ablation visualization on the roles of $\mathcal{A}_t^m$ and $\mathcal{A}_t^s$ in our model. The size of dots ($\bullet$) located in the center of visual patches represents the aggregated weight from the adjacency matrix during the graph network learning.
  • Figure 5: Ablation study on the number of graph layers.