Self-Explainable Affordance Learning with Embodied Caption
Zhipeng Zhang, Zhimin Wei, Guolei Sun, Peng Wang, Luc Van Gool
TL;DR
The paper tackles action ambiguity in visual affordance learning by proposing Self-Explainable Affordance learning with embodied captions (SEA), a framework that jointly localizes affordance heatmaps and generates embodied object-action captions. It introduces the SEA dataset (built on AGD20K) with exocentric and egocentric images, embodied captions, and corresponding heatmaps, and a model that fuses visual priors from $\text{DINO-ViT}$ and $\text{CLIP}$ through a Pixel-level Fusion Former and cross-domain attention. The approach uses a CLIP-based text encoder to guide heatmap localization via text-visual similarity, and employs cosine and contrastive losses to align modalities, enabling robust, self-explanatory predictions such as 'I will beat the drums' or 'I am going to carry the drum'. Empirical results show improved visual grounding and caption accuracy, supporting more interpretable and controllable robot behavior in open-world settings and enabling timely human feedback for error rectification.
Abstract
In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.
