Table of Contents
Fetching ...

Learning Visual Affordance from Audio

Lidong Lu, Guo Chen, Zhu Wei, Yicheng Liu, Tong Lu

TL;DR

This work defines the Audio-Visual Affordance Grounding (AV-AG) task to locate fine-grained, interactable regions in images based on action sounds. It introduces the AVAGD dataset, which provides pixel-level masks for function and dependency regions across diverse domains and an unseen subset for zero-shot evaluation. The paper also presents AVAGFormer, an end-to-end model with a semantic-conditioned cross-modal mixer and a dual-head decoder that fuses audio and visual information to predict both region types, achieving state-of-the-art results and demonstrating strong generalization. Together, AV-AG, AVAGD, and AVAGFormer establish a new multimodal benchmark for fine-grained affordance reasoning with audio guidance, with potential impact on embodied AI and interactive systems.

Abstract

We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.

Learning Visual Affordance from Audio

TL;DR

This work defines the Audio-Visual Affordance Grounding (AV-AG) task to locate fine-grained, interactable regions in images based on action sounds. It introduces the AVAGD dataset, which provides pixel-level masks for function and dependency regions across diverse domains and an unseen subset for zero-shot evaluation. The paper also presents AVAGFormer, an end-to-end model with a semantic-conditioned cross-modal mixer and a dual-head decoder that fuses audio and visual information to predict both region types, achieving state-of-the-art results and demonstrating strong generalization. Together, AV-AG, AVAGD, and AVAGFormer establish a new multimodal benchmark for fine-grained affordance reasoning with audio guidance, with potential impact on embodied AI and interactive systems.

Abstract

We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.

Paper Structure

This paper contains 35 sections, 7 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Comparison with text driven luo2022learning and demo video driven fang2018demo2vec affordance grounding, audio driven can help build intuitive perception of interaction regions through sound.
  • Figure 2: The semi-automatic data annotation pipeline used in the AVAGD dataset
  • Figure 3: Properties of the AVAGD dataset: (a) Distribution of categories, including 7 domains, 55 affordance categories, and 97 object categories. (b) Word cloud visualization of the affordances in the AVAGD dataset. (c) Annotation samples from the AVAGD dataset. (d) The number of images and audios per object category in the AVAGD dataset.
  • Figure 4: Overview of our proposed AVAGFormer. It consist of three key components: visual and audio feature extraction and integration, audio-visual mixer, and affordance decoder.
  • Figure 5: Architecture of the dual-head affordance decoder.
  • ...and 5 more figures