Learning Visual Affordance from Audio

Lidong Lu; Guo Chen; Zhu Wei; Yicheng Liu; Tong Lu

Learning Visual Affordance from Audio

Lidong Lu, Guo Chen, Zhu Wei, Yicheng Liu, Tong Lu

TL;DR

This work defines the Audio-Visual Affordance Grounding (AV-AG) task to locate fine-grained, interactable regions in images based on action sounds. It introduces the AVAGD dataset, which provides pixel-level masks for function and dependency regions across diverse domains and an unseen subset for zero-shot evaluation. The paper also presents AVAGFormer, an end-to-end model with a semantic-conditioned cross-modal mixer and a dual-head decoder that fuses audio and visual information to predict both region types, achieving state-of-the-art results and demonstrating strong generalization. Together, AV-AG, AVAGD, and AVAGFormer establish a new multimodal benchmark for fine-grained affordance reasoning with audio guidance, with potential impact on embodied AI and interactive systems.

Abstract

We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.

Learning Visual Affordance from Audio

TL;DR

Abstract

Learning Visual Affordance from Audio

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)