Table of Contents
Fetching ...

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Shentong Mo, Haofan Wang

TL;DR

The paper addresses visual sound localization and segmentation in videos under weak supervision, where only video-level labels are available. It introduces M2VSL, a framework that aligns multi-scale visual features with audio via a novel Multi-scale Multi-instance Contrastive (MMC) objective and a Multi-scale Multi-instance Transformer (MMT). The approach achieves state-of-the-art results on VGGSound-Instruments, VGG-Sound-Sources, and AVSBench for both localization and segmentation, without requiring pixel-level annotations. By learning discriminative cross-modal regions across scales, M2VSL advances multi-source audio-visual perception with potential applications in multimedia, surveillance, and assistive technologies.

Abstract

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.

Multi-scale Multi-instance Visual Sound Localization and Segmentation

TL;DR

The paper addresses visual sound localization and segmentation in videos under weak supervision, where only video-level labels are available. It introduces M2VSL, a framework that aligns multi-scale visual features with audio via a novel Multi-scale Multi-instance Contrastive (MMC) objective and a Multi-scale Multi-instance Transformer (MMT). The approach achieves state-of-the-art results on VGGSound-Instruments, VGG-Sound-Sources, and AVSBench for both localization and segmentation, without requiring pixel-level annotations. By learning discriminative cross-modal regions across scales, M2VSL advances multi-source audio-visual perception with potential applications in multimedia, surveillance, and assistive technologies.

Abstract

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.
Paper Structure (14 sections, 6 equations, 4 figures, 4 tables)

This paper contains 14 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of M2VSL with state-of-the-art methods on multi-source visual sound localization (Class-aware IoU@0.3).
  • Figure 2: Illustration of the proposed Multi-scale Multi-instance Visual Sound Localization (M2VSL) framework for weakly-supervised audio-visual localization and segmentation.
  • Figure 3: Qualitative comparisons with weakly-supervised semantic segmentation and visual sound source localization baselines. The proposed M2VSL generates more accurate and high-quality segmentation maps for sounding objects.
  • Figure 4: Effect of batch size on eakly-supervised audio-visual segmentation (mIoU and F-score are reported).