Table of Contents
Fetching ...

Grounding 3D Scene Affordance From Egocentric Interactions

Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, Zheng-Jun Zha

TL;DR

This work tackles grounding fine-grained 3D scene affordances from egocentric interactions, addressing limitations of passive semantic mappings and trial-and-error RL. It introduces Ego-SAG, a dual-module framework consisting of an Interaction-Guided Spatial Significance Allocation (ISA) module and a Bilateral Query Decoder (BQD) that jointly localize interaction-relevant sub-regions and align video- and 3D-scene affordance cues. To support this task, the authors present the Video-3D Scene Affordance Dataset (VSAD), a large-scale benchmark with 3,814 egocentric videos, 2,086 3D scenes, 17 affordance categories, and 7,690 ground-truth interactive regions. Experiments show Ego-SAG outperforms open-vocabulary and static-structure baselines across multiple metrics, demonstrating improved cross-modal grounding and paving the way for more proactive embodied agents in AR/VR and robotics.

Abstract

Grounding 3D scene affordance aims to locate interactive regions in 3D environments, which is crucial for embodied agents to interact intelligently with their surroundings. Most existing approaches achieve this by mapping semantics to 3D instances based on static geometric structure and visual appearance. This passive strategy limits the agent's ability to actively perceive and engage with the environment, making it reliant on predefined semantic instructions. In contrast, humans develop complex interaction skills by observing and imitating how others interact with their surroundings. To empower the model with such abilities, we introduce a novel task: grounding 3D scene affordance from egocentric interactions, where the goal is to identify the corresponding affordance regions in a 3D scene based on an egocentric video of an interaction. This task faces the challenges of spatial complexity and alignment complexity across multiple sources. To address these challenges, we propose the Egocentric Interaction-driven 3D Scene Affordance Grounding (Ego-SAG) framework, which utilizes interaction intent to guide the model in focusing on interaction-relevant sub-regions and aligns affordance features from different sources through a bidirectional query decoder mechanism. Furthermore, we introduce the Egocentric Video-3D Scene Affordance Dataset (VSAD), covering a wide range of common interaction types and diverse 3D environments to support this task. Extensive experiments on VSAD validate both the feasibility of the proposed task and the effectiveness of our approach.

Grounding 3D Scene Affordance From Egocentric Interactions

TL;DR

This work tackles grounding fine-grained 3D scene affordances from egocentric interactions, addressing limitations of passive semantic mappings and trial-and-error RL. It introduces Ego-SAG, a dual-module framework consisting of an Interaction-Guided Spatial Significance Allocation (ISA) module and a Bilateral Query Decoder (BQD) that jointly localize interaction-relevant sub-regions and align video- and 3D-scene affordance cues. To support this task, the authors present the Video-3D Scene Affordance Dataset (VSAD), a large-scale benchmark with 3,814 egocentric videos, 2,086 3D scenes, 17 affordance categories, and 7,690 ground-truth interactive regions. Experiments show Ego-SAG outperforms open-vocabulary and static-structure baselines across multiple metrics, demonstrating improved cross-modal grounding and paving the way for more proactive embodied agents in AR/VR and robotics.

Abstract

Grounding 3D scene affordance aims to locate interactive regions in 3D environments, which is crucial for embodied agents to interact intelligently with their surroundings. Most existing approaches achieve this by mapping semantics to 3D instances based on static geometric structure and visual appearance. This passive strategy limits the agent's ability to actively perceive and engage with the environment, making it reliant on predefined semantic instructions. In contrast, humans develop complex interaction skills by observing and imitating how others interact with their surroundings. To empower the model with such abilities, we introduce a novel task: grounding 3D scene affordance from egocentric interactions, where the goal is to identify the corresponding affordance regions in a 3D scene based on an egocentric video of an interaction. This task faces the challenges of spatial complexity and alignment complexity across multiple sources. To address these challenges, we propose the Egocentric Interaction-driven 3D Scene Affordance Grounding (Ego-SAG) framework, which utilizes interaction intent to guide the model in focusing on interaction-relevant sub-regions and aligns affordance features from different sources through a bidirectional query decoder mechanism. Furthermore, we introduce the Egocentric Video-3D Scene Affordance Dataset (VSAD), covering a wide range of common interaction types and diverse 3D environments to support this task. Extensive experiments on VSAD validate both the feasibility of the proposed task and the effectiveness of our approach.
Paper Structure (18 sections, 8 equations, 9 figures, 3 tables)

This paper contains 18 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Grounding 3D Scene Affordance from Egocentric interactions. Given an egocentric video describing an interaction and a 3D scene, we propose to ground the corresponding affordance area in the 3D scene.
  • Figure 2: Method. Ego-SAG first uses modality-specific encoders to extract features (Sec.\ref{['sec:modalitywise']}), modeling the relationship between interaction intent and scene sub-regions during the 3D U-Net decoding stage (Sec.\ref{['sec:ISSA']}). It then progressively unearths and aligns the corresponding affordance information in the interaction video and the 3D scene (Sec.\ref{['BQD']}).
  • Figure 3: Properties of the VSAD Dataset.(a) Example data pairs from the VSAD dataset, with the video displayed on the left and the corresponding 3D scene visualization on the right. The red regions in the point cloud represent affordance annotations. (b) Distribution of video data: the horizontal axis represents the categories of interactive objects, the vertical axis shows the quantity of data, and different colors indicate various affordance. (c) The ratio of video data to 3D scene affordance regions for each object class. It demonstrates that videos and 3D scenes are not confined to one-to-one pairings, allowing for multiple associations between them.
  • Figure 4: Visualization Results. Each sample consists of three rows: the first row is the egocentric video demonstrating the interaction that the scene can afford, the second row shows the results of the comparison methods, and the third row shows the result of our method and GT. Scene affordance masks are colored red. Please zoom in for a better visualization.
  • Figure 5: T-SNE Visualization Results. The t-SNE results for the baseline without any modules (a) and our model (b).
  • ...and 4 more figures