Text-driven Affordance Learning from Egocentric Vision

Tomoya Yoshida; Shuhei Kurita; Taichi Nishimura; Shinsuke Mori

Text-driven Affordance Learning from Egocentric Vision

Tomoya Yoshida, Shuhei Kurita, Taichi Nishimura, Shinsuke Mori

TL;DR

This work tackles visual affordance learning by enabling robots to ground textual instructions into actionable contact points and manipulation trajectories from egocentric vision. It introduces TextAFF80K, a large pseudo-labeled dataset built from Ego4D and Epic-Kitchens via a homography-based projection pipeline and detectors, to train models that predict heatmaps of contact points and parametric trajectories. By extending referring expression comprehension models (CLIPSeg and MDETR) to output both spatial and temporal affordances, the approach achieves robust performance across hand-object and tool-object interactions, with rotations included in trajectories for complex manipulations. The results highlight the value of textual input for grounding affordances and show that tool-object tasks benefit most from language-grounded models, while hand-object tasks benefit from strong object detectors and linear-trajectory modeling; future work will explore 3D environments and real-world robotic deployment.

Abstract

Visual affordance learning is a key component for robots to understand how to interact with objects. Conventional approaches in this field rely on pre-defined objects and actions, falling short of capturing diverse interactions in realworld scenarios. The key idea of our approach is employing textual instruction, targeting various affordances for a wide range of objects. This approach covers both hand-object and tool-object interactions. We introduce text-driven affordance learning, aiming to learn contact points and manipulation trajectories from an egocentric view following textual instruction. In our task, contact points are represented as heatmaps, and the manipulation trajectory as sequences of coordinates that incorporate both linear and rotational movements for various manipulations. However, when we gather data for this task, manual annotations of these diverse interactions are costly. To this end, we propose a pseudo dataset creation pipeline and build a large pseudo-training dataset: TextAFF80K, consisting of over 80K instances of the contact points, trajectories, images, and text tuples. We extend existing referring expression comprehension models for our task, and experimental results show that our approach robustly handles multiple affordances, serving as a new standard for affordance learning in real-world scenarios.

Text-driven Affordance Learning from Egocentric Vision

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 5 figures, 4 tables)

This paper contains 15 sections, 2 equations, 5 figures, 4 tables.

Introduction
Related Work
Visual Affordance Learning
Egocentric Vision
Referring Expression Comprehension
Text-driven Affordance Learning
Pseudo-Label Creation
Network Architecture
Experiments
Experimental Settings
Results
Detailed analysis in Hand-Object Interactions
Ablation Study
Qualitative Analysis
Conclusion

Figures (5)

Figure 1: (a) VRB vrb: the most closely related work. (b) Our task: Text-driven affordance learning. In our task, given an image and text, the model aims to predict the contact points and the manipulation trajectory for executing the textual instructions.
Figure 2: The flow of our pseudo-label creation. This approach consists of three components: (1) interaction classification, (2) projection of contact points, and (3) projection of trajectory. Given an egocentric video, we first judge interaction type. Then we extract contact points and their trajectories from frames $F_{inter}$ during the interaction. Finally, we project them into frames $F_{obs}$ before the interaction.
Figure 3: Manual annotation tool for collecting test data.
Figure 4: Qualitative results. (a) and (b) depict the results of hand-object interaction, and (c) and (d) depict the results of tool-object interaction. The white dashed line presents trajectories. We cropped images for a clearer display, but used the full images as input for the model.
Figure 5: Statistics of verb frequencies in the test set.

Text-driven Affordance Learning from Egocentric Vision

TL;DR

Abstract

Text-driven Affordance Learning from Egocentric Vision

Authors

TL;DR

Abstract

Table of Contents

Figures (5)