Table of Contents
Fetching ...

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Yuanyuan Jiang, Jianqin Yin

TL;DR

The paper tackles fine-grained AVQA by introducing CLIP-powered TASS-Net, a single-stream architecture that exploits CLIP's image-text knowledge through a target-aware spatial grounding module (TSG+) and a unified joint audio-visual temporal grounding module (JTG). A cross-modal synchrony (CMS) loss based on $JS$ divergence extends image-text matching to audio-text matching, enabling question-guided, temporally coherent fusion of audio-visual information. Key contributions include the TSG+ mechanism for region-level grounding without ground-truth labels, the CMS-driven JTG that fuses fusion and temporal grounding in one stream, and an effective preprocessing scheme that preserves video content while reducing compute. On MUSIC-AVQA, the method achieves state-of-the-art All-question accuracy (around $74.98\%$), demonstrates strong gains on counting and AV questions, and provides qualitative evidence of sound-target grounding and temporally synchronized attention, suggesting wide applicability to audiovisual reasoning tasks.

Abstract

While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, we propose a TSG+ module to transfer the image-text matching knowledge from CLIP models to our region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the pretrained image-text knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods.

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

TL;DR

The paper tackles fine-grained AVQA by introducing CLIP-powered TASS-Net, a single-stream architecture that exploits CLIP's image-text knowledge through a target-aware spatial grounding module (TSG+) and a unified joint audio-visual temporal grounding module (JTG). A cross-modal synchrony (CMS) loss based on divergence extends image-text matching to audio-text matching, enabling question-guided, temporally coherent fusion of audio-visual information. Key contributions include the TSG+ mechanism for region-level grounding without ground-truth labels, the CMS-driven JTG that fuses fusion and temporal grounding in one stream, and an effective preprocessing scheme that preserves video content while reducing compute. On MUSIC-AVQA, the method achieves state-of-the-art All-question accuracy (around ), demonstrates strong gains on counting and AV questions, and provides qualitative evidence of sound-target grounding and temporally synchronized attention, suggesting wide applicability to audiovisual reasoning tasks.

Abstract

While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, we propose a TSG+ module to transfer the image-text matching knowledge from CLIP models to our region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the pretrained image-text knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods.
Paper Structure (17 sections, 10 equations, 6 figures, 7 tables)

This paper contains 17 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: An illustration of AVQA. Our proposed TASS-Net leverages the prior image-text matching knowledge from the pretrained model and transfers it to the AVQA model for spatio-temporal reasoning. The question is centered around the "instruments" (i.e., the target) and is broken down into "how many", "did not sound", and "from beginning to end" in terms of visual space, audio, and temporality, respectively. To answer this may entail a significant time investment for a human viewer, but an AI system with effective audio-visual scene parsing and spatio-temporal reasoning capabilities can achieve it promptly.
  • Figure 2: Comparison of different question-aware temporal grounding. (a.) The traditional approach usually adopts a dual-stream network that treats audio and video as separate entities. (b.) Our proposed single-stream architecture treats audio and video as a whole and ensures the correlation between them, incorporating the two processes of temporal grounding and fusion.
  • Figure 3: The proposed target-aware single-stream network. We introduce text modality with explicit semantics into the audio-visual spatial grounding to associate specific sound-related visual features with the subject of interest, i.e., the target. We exploit the proposed cross-modal synchrony loss to incorporate audio-visual fusion and question-aware temporal grounding within a single-stream architecture. Finally, simple fusion is employed to integrate audio-visual and question information for predicting the answer.
  • Figure 4: The illustration of the Target-aware Spatial Grounding module (TSG+), which leverages the explicit semantics from textual modality to conduct visual spatial grounding.
  • Figure 5: Visualized target-aware spatial grounding results. Based on the grounding results of our method, the sounding area of interest are accordingly highlighted in spatial perspectives in different cases (a-c), respectively, which indicates that our method can focus on the query subject, facilitating the question-oriented scene understanding and reasoning.
  • ...and 1 more figures