Table of Contents
Fetching ...

TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

Hugo Malard, Michel Olvera, Stephane Lathuiliere, Slim Essid

TL;DR

This work tackles unsupervised sound-prompted segmentation by introducing TACO, a training-free framework that performs semantically constrained audio-visual co-factorization (Sem Co-NMF) on frozen CLIP and CLAP features to uncover co-activated concepts. Sem Co-NMF uses semantic anchors to align audio and visual representations in a shared semantic space, enabling local cross-modal correspondence without fine-tuning. The decomposition yields an interpretable dominant concept that seeds an open-vocabulary segmenter (FC-CLIP) for precise segmentation in a zero-shot setting, achieving state-of-the-art results on AVSBench, ADE SP, and AVSS while preserving model generalization. The approach offers a scalable, interpretable path for multi-source audio-visual localization and downstream open-vocabulary segmentation, with practical implications for zero-shot multimodal understanding.

Abstract

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

TL;DR

This work tackles unsupervised sound-prompted segmentation by introducing TACO, a training-free framework that performs semantically constrained audio-visual co-factorization (Sem Co-NMF) on frozen CLIP and CLAP features to uncover co-activated concepts. Sem Co-NMF uses semantic anchors to align audio and visual representations in a shared semantic space, enabling local cross-modal correspondence without fine-tuning. The decomposition yields an interpretable dominant concept that seeds an open-vocabulary segmenter (FC-CLIP) for precise segmentation in a zero-shot setting, achieving state-of-the-art results on AVSBench, ADE SP, and AVSS while preserving model generalization. The approach offers a scalable, interpretable path for multi-source audio-visual localization and downstream open-vocabulary segmentation, with practical implications for zero-shot multimodal understanding.

Abstract

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

Paper Structure

This paper contains 29 sections, 7 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Our method takes a representation of an image and its associated audio as input, decomposing them into a product of 'semantic' factors and (spatial or temporal) activations. This decomposition enables locating parts of the original input corresponding to the concept present in both the image and the audio.
  • Figure 2: Complete pipeline: both the audio and the image are encoded using their respective encoder and their representations are used to perform the co-NMF. FC-CLIP is prompted using the image factors ($V_I$) and the segmentation corresponding to the sounding image factor ($V_I^{k\star}$) is kept as the final segmentation.
  • Figure 3: As CLIP and CLAP encode audio and images in different spaces, our method employs semantic anchors to project semantic components (representations soft-masked by $U_I^k$ and $U_A^k$) in an audio-visual semantic space where standard distances can be considered to compute the penalty function.
  • Figure 4: Qualitative segmentation examples from ADE SP and S4 datasets: TACO performs well even in challenging cases
  • Figure 5: Multiple source segmentation examples. As the sources change during the video, TACO's segmentation changes as well
  • ...and 6 more figures