Self-Explainable Affordance Learning with Embodied Caption

Zhipeng Zhang; Zhimin Wei; Guolei Sun; Peng Wang; Luc Van Gool

Self-Explainable Affordance Learning with Embodied Caption

Zhipeng Zhang, Zhimin Wei, Guolei Sun, Peng Wang, Luc Van Gool

TL;DR

The paper tackles action ambiguity in visual affordance learning by proposing Self-Explainable Affordance learning with embodied captions (SEA), a framework that jointly localizes affordance heatmaps and generates embodied object-action captions. It introduces the SEA dataset (built on AGD20K) with exocentric and egocentric images, embodied captions, and corresponding heatmaps, and a model that fuses visual priors from $\text{DINO-ViT}$ and $\text{CLIP}$ through a Pixel-level Fusion Former and cross-domain attention. The approach uses a CLIP-based text encoder to guide heatmap localization via text-visual similarity, and employs cosine and contrastive losses to align modalities, enabling robust, self-explanatory predictions such as 'I will beat the drums' or 'I am going to carry the drum'. Empirical results show improved visual grounding and caption accuracy, supporting more interpretable and controllable robot behavior in open-world settings and enabling timely human feedback for error rectification.

Abstract

In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.

Self-Explainable Affordance Learning with Embodied Caption

TL;DR

and

through a Pixel-level Fusion Former and cross-domain attention. The approach uses a CLIP-based text encoder to guide heatmap localization via text-visual similarity, and employs cosine and contrastive losses to align modalities, enabling robust, self-explanatory predictions such as 'I will beat the drums' or 'I am going to carry the drum'. Empirical results show improved visual grounding and caption accuracy, supporting more interpretable and controllable robot behavior in open-world settings and enabling timely human feedback for error rectification.

Abstract

Paper Structure (17 sections, 13 equations, 11 figures, 3 tables)

This paper contains 17 sections, 13 equations, 11 figures, 3 tables.

Introduction
Related Work
Visual Affordance Learning
Embodied Vision-Language
Methodology
Self-Explainable Affordance Learning
SEA Dataset
SEA Model
Experiement
Experimental Setting
Evaluation Metric
Baselines and Comparisons
Ablation Study
Discussion and Visualization
Conclusion
...and 2 more sections

Figures (11)

Figure 1: Existing visual affordance learning suffers from the challenges (a-b) and solutions (c-d). (a) Action ambiguity: Pick it up to drink? Pour it down? (b) Multi-objects ambiguity: Each heat point corresponds to a distinct behavioral purpose. To tackle these issues, we introduce self-explainable affordance learning. In (c), the robot can say which action it is going to do. In (d), objects intended for interaction are distinguished with clear and defined descriptions.
Figure 2: Examples from the SEA Dataset. Our dataset provides meaningful action captions for both exocentric and egocentric images within the realm of visual affordance learning.
Figure 3: The statistics on the length of captions in our dataset: (a) represents the distribution of caption length in the seen scene, and (b) shows the result in the unseen scene.
Figure 4: The Word Cloud for the SEA Dataset, which displays various action and object categories in our dataset.
Figure 5: Overview of our proposed framework. It consists of three parts: 1. Visual Embedding from different domains 2. Self-Explainable Module and Embodied Caption. 3. VL affordance prediection.
...and 6 more figures

Self-Explainable Affordance Learning with Embodied Caption

TL;DR

Abstract

Self-Explainable Affordance Learning with Embodied Caption

Authors

TL;DR

Abstract

Table of Contents

Figures (11)