VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

Aihua Mao; Kaihang Huang; Yong-Jin Liu; Chee Seng Chan; Ying He

VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

Aihua Mao, Kaihang Huang, Yong-Jin Liu, Chee Seng Chan, Ying He

TL;DR

VAGNet is proposed, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address, and PVAD is introduced, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works.

Abstract

3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 7 figures, 2 tables)

This paper contains 14 sections, 8 equations, 7 figures, 2 tables.

Introduction
Related Work
Method
Overview
Point Cloud-Video Pairing (PVAD) Dataset
2D-3D Alignment with Contextual Attention
Spatial-Temporal Fusion Module
Affordance Decoding and Loss Function
Experiments
Setup
Comparison Results
Ablation Studies
Performance Analysis
Conclusion

Figures (7)

Figure 1: Motivation of Our Work. (a) Existing 3D affordance grounding methods rely mainly on static visual or textual cues, forcing the model to infer how an object might be used from its shape or static interaction context. Consequently, they struggle with perspective ambiguity, visually similar parts (e.g. blade vs. handle), and complex multi-contact interactions. (b) Human-object interaction videos, in contrast, reveal affordance directly through use: they show how hands approach, contact, and move across object surfaces. Our approach leverages this insight by contextually aligning video observations with 3D object geometry and mapping the extracted interaction cues into 3D space. This paradigm shift grounds affordance by observing actual use rather than inferring from appearance, providing richer functional supervision and enabling more reliable 3D affordance grounding.
Figure 2: VAGNet Architecture. Our model takes a point cloud, its 2D projection, and a corresponding interaction video as input. These are first processed by three modality-specific encoders to extract point features ($F_p$), image features ($F_i$), and video features ($F_v$). Then, the Multimodal Contextual Alignment Module (MCAM) aligns $F_i$ with $F_v$ to produce a joint 2D representation $F_{2d}$, which is then fused with $F_p$ through a cross-attention layer and a point decoder to obtain the context-aligned 3D feature $F_{3d}$. Subsequently, the Spatial-Temporal Fusion Module (STFM) integrates $F_v$ and $F_{3d}$ to produce the spatio-temporal feature $F_f$, which is finally fed into a decoder to generate the 3D affordance mask.
Figure 3: The PVAD Dataset. (a) Overview with annotated affordance regions on point clouds highlighted in red. (b) Distribution of video samples across different affordance categories. (c) Statistics of video and point cloud counts for representative object-affordance pairs.
Figure 4: Comparative visualization of affordance results generated by different methods, on selected test instances from Seen and Unseen settings. Each object is associated with its corresponding interaction video, represented by two sampled frames. The affordance probability for each 3D point is encoded in a heatmap (red means high probability). Ground truth (GT) are provided for reference.
Figure 5: Ablation of MCAM. (a) Contextual Attention Maps during the alignment between project image and video instruction. (b) Visualization results with/without MCAM.
...and 2 more figures

VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

TL;DR

Abstract

VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (7)