VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

Hanqing Wang; Mingyu Liu; Xiaoyu Chen; Chengwei MA; Yiming Zhong; Wenti Yin; Yuhao Liu; Zhiqing Cui; Jiahao Yuan; Lu Dai; Zhiyuan Ma; Hui Xiong

VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei MA, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, Zhiyuan Ma, Hui Xiong

TL;DR

This work tackles grounding actionable regions on 3D objects from human–object interaction videos by introducing VIDA, a large-scale video–point cloud dataset, and VideoAfford, a baseline that transfers HOI priors into 3D affordance grounding. The method integrates a 3D vision backbone, a latent action encoder, and a video multimodal language model with an affordance-conditioned decoder, augmented by a spatially aware loss to enforce coherent 3D segmentation. Empirical results show substantial gains over strong baselines in both seen and unseen settings, with robust open-world generalization, validating the approach for practical embodied perception. The work enables more reliable, data-driven 3D affordance reasoning for robotic manipulation and downstream embodied AI tasks, and will release datasets and code to foster further research.

Abstract

3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.

VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

TL;DR

Abstract

Paper Structure (28 sections, 11 equations, 4 figures, 4 tables)

This paper contains 28 sections, 11 equations, 4 figures, 4 tables.

Introduction
Related Works
Affordance Learning.
Multimodal Large Language Model.
3D Spatial Reasoning.
Datasets
Collection Details
Videos and Point Clouds.
Statistic and Analysis
Methods
Architecture Overview
Network Architecture
Point Encoder.
Spatial Constraints.
Action Encoder.
...and 13 more sections

Figures (4)

Figure 1: Data Collection Pipeline. We show the whole data collection and verification pipeline here. First, we utilize VLMs to caption each video and extract keywords about action and objects. We then utilize the VLMs to pair the video to an affordance type. Finally, we manually check the results to ensure correctness.
Figure 2: VIDA Dataset. Here we illustrate the detailed information of VIDA. a) shows the examples of the video and corresponding affordance point clouds. b) shows the videos and point clouds radios, and c) shows the category distributions of VIDA.
Figure 3: Overview of VideoAfford. Given an HOI video and a corresponding point cloud, VideoAfford adopts the LanguageBind as the video encoder and RenderNet as the action encoder to obtain the video embeddings and latent action embeddings. Then the video embeddings and latent action embeddings are fed into the Large Language Model to predict the language tokens and the affordance token. On the other hand, VideoAfford utilizes a pre-trained 3D encoder to extract the semantic-rich point embeddings, which are then fed into a geometric guided upsample and propagation module to obtain dense point features. Finally, the affordance token and the point features are fed into the affordance decoder to obtain the affordance masks. More details about the propagation process can be seen in appendix.
Figure 4: Visualization Results. The first column is the HOI videos, and the last column is the ground truth of 3D object affordance in the point cloud. The depth of red represents the affordance probability. Refer to our supplementary materials for more results.

VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

TL;DR

Abstract

VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)