Table of Contents
Fetching ...

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu

TL;DR

This work tackles the AVS bottleneck where audio alone provides weak semantic guidance, especially in multi-source scenes. It introduces TeSO, which uses dense scene descriptions and frozen LLMs to extract text cues via CoT reasoning, and a Semantics-Driven Audio Modeling (SeDAM) module to fuse audio with text cues through a crossmodal transformer. A Prompting Mask Queries with Semantics (PMQS) component injects semantic audio-text information into mask queries, while lightweight adapters refine decoding. Across AVS benchmarks, TeSO achieves competitive results and demonstrates enhanced sensitivity to audio changes when aided by text cues, highlighting the potential of textual semantics to strengthen audio-visual grounding and generalize to unseen objects.

Abstract

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

TL;DR

This work tackles the AVS bottleneck where audio alone provides weak semantic guidance, especially in multi-source scenes. It introduces TeSO, which uses dense scene descriptions and frozen LLMs to extract text cues via CoT reasoning, and a Semantics-Driven Audio Modeling (SeDAM) module to fuse audio with text cues through a crossmodal transformer. A Prompting Mask Queries with Semantics (PMQS) component injects semantic audio-text information into mask queries, while lightweight adapters refine decoding. Across AVS benchmarks, TeSO achieves competitive results and demonstrates enhanced sensitivity to audio changes when aided by text cues, highlighting the potential of textual semantics to strengthen audio-visual grounding and generalize to unseen objects.

Abstract

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}
Paper Structure (31 sections, 11 equations, 9 figures, 11 tables)

This paper contains 31 sections, 11 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: The previous methods (left) achieve satisfactory results in receiving normal audio input. However, even when the sound is completely silent, they still segment vast pixels to represent the guitar regarding the segmentation preference of audible objects built during the training. In contrast, our approach (right) utilizes a frozen LLM to reason from scene descriptions that the male is not playing the guitar, as his hands are off the strings. Guided by this semantic information, our method produces more precise segmentation with finer audio guidance.
  • Figure 2: Overall pipeline of the proposed TeSO method. We utilize an off-the-shelf image captioner for dense scene describing and employ a frozen LLM to reason out potential sounding objects as text cues. These semantic text cues are then aggregated with audio features in the SeDAM module to form the sounding object features. Subsequently, we introduce the sounding object features into pre-trained mask queries in the PMQS module. Finally, we use adapters to tune the visual-only mask decoder for AVS in the Audio-prompted Decoding phase. “MSDA” is the multi-scale deformable attention proposed by Zhu et al. zhu2020deformable.
  • Figure 3: "P.S.O" stands for "Potential Sounding Object". A frozen LLM reasoner works as a text cues capturer by considering the interaction of audible objects. For instance, a person may sing a song while playing the guitar, but he would not sing along with a saxophone. In contrast, a noun parser simply captures any nouns that are present.
  • Figure 4: Examples of the impact of normal audio input and all-mute audio on popular methods. In normal scenarios, our method shows better masks than previous methods. In all-mute scenarios, our approach exhibits strong sensitivity towards audio inputs, as it is capable of generating blank masks for silent audio clips.
  • Figure A1: Few-shot prompt with CoT instructions. We feed these prompts to LLMs before each time we generate the reasoning results. This template generates the best result of \ref{['tab:prompt_template']}.
  • ...and 4 more figures