Table of Contents
Fetching ...

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

He Zhu, Quyu Kong, Kechun Xu, Xunlong Xia, Bing Deng, Jieping Ye, Rong Xiong, Yue Wang

TL;DR

This work tackles grounding 3D object affordances by integrating language instructions, visual observations, and interactions. It introduces AGPIL, the first multi-modal, multi-view 3D affordance dataset with full-view, partial-view, and rotation-view data across seen/unseen splits, and proposes LMAffordance3D, a one-stage architecture that fuses 2D images, 3D point clouds, and language via a vision-language backbone to produce per-point affordance heatmaps. The method combines a ResNet-18 2D encoder, a PointNet++ 3D encoder, and a cross-attention decoder within a LLaVA-7B–based framework to ground affordances conditioned on language instructions. Experiments show that LMAffordance3D outperforms baselines, generalizes better to unseen objects and actions, and maintains robustness across view variations, indicating strong potential for robot manipulation tasks in real-world settings.

Abstract

Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings. Our project is available at https://sites.google.com/view/lmaffordance3d.

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

TL;DR

This work tackles grounding 3D object affordances by integrating language instructions, visual observations, and interactions. It introduces AGPIL, the first multi-modal, multi-view 3D affordance dataset with full-view, partial-view, and rotation-view data across seen/unseen splits, and proposes LMAffordance3D, a one-stage architecture that fuses 2D images, 3D point clouds, and language via a vision-language backbone to produce per-point affordance heatmaps. The method combines a ResNet-18 2D encoder, a PointNet++ 3D encoder, and a cross-attention decoder within a LLaVA-7B–based framework to ground affordances conditioned on language instructions. Experiments show that LMAffordance3D outperforms baselines, generalizes better to unseen objects and actions, and maintains robustness across view variations, indicating strong potential for robot manipulation tasks in real-world settings.

Abstract

Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings. Our project is available at https://sites.google.com/view/lmaffordance3d.

Paper Structure

This paper contains 18 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration for the affordance grounding task. Inspired by cognitive science, when humans encounter a new object, they learn its affordance through language instructions, vision information from eyes, and human-machine interactions, thus obtaining its affordance.
  • Figure 2: Examples and statistics of the AGPIL dataset. Figures (a) and (b) are examples of images, point clouds, and a certain affordance that we randomly selected from the dataset. Figure (c) shows a word cloud generated according to the frequency of each word appearing in the language instructions. Figures (d) and (e) respectively show the distribution of affordances corresponding to different objects in image and point cloud data. The horizontal axis represents the types of objects, and the vertical axis represents the quantity. Different colors indicate different affordances. Figure (f) illustrates the distribution of image and point cloud data corresponding to each affordance. It indicates that images and point clouds are not a one-to-one match, as a single image may correspond to multiple objects.
  • Figure 3: Language instructions generation. In this example, the affordance is "cut" and the object category is "knife".
  • Figure 4: Method. The structure of the proposed LMAffordance3D model, which consists of four major components: 1) a vision encoder that processes multi-modal data, including images and point clouds, to encode and fuse the 2D and 3D features; 2) a vision-language model and its associated component (tokenizer and adapter) that takes in the instruction token, 2D and 3D vision token for fusion; 3) a decoder that uses 2D and 3D spatial features as query, instructional features as key and semantic feature as value to predict the affordance feature; 4) a head for segmenting and grounding 3D object affordance.
  • Figure 5: Visualization. We select several examples from the test set under different views and experimental settings, showcasing the model’s inputs and outputs, and comparing them with the ground truth (GT).
  • ...and 1 more figures