Table of Contents
Fetching ...

Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

Lian He, Meng Liu, Qilang Ye, Yu Zhou, Xiang Deng, Gangyi Ding

TL;DR

This work tackles 3D scene-level affordance segmentation from natural language to enable embodied agents to interact with real environments. It introduces TASA, a geometry-optimized, coarse-to-fine framework that fuses 2D semantic cues and 3D geometric reasoning, featuring a task-aware 2D affordance detection stage and a 3D refinement stage. Key components include a CLIP-guided affordance-weighted frame selector, a manipulable-point validation module, and a Point Transformer-based 3D refinement with a multi-objective loss. On SceneFun3D, TASA achieves state-of-the-art accuracy (e.g., AP50 = 26.9, mIoU = 19.7) and notable efficiency gains (3.37× speedup, ~40% fewer FLOPs), underscoring the value of integrating 2D semantic priors with 3D geometry for robust, high-fidelity affordance segmentation in complex scenes.

Abstract

Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.

Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

TL;DR

This work tackles 3D scene-level affordance segmentation from natural language to enable embodied agents to interact with real environments. It introduces TASA, a geometry-optimized, coarse-to-fine framework that fuses 2D semantic cues and 3D geometric reasoning, featuring a task-aware 2D affordance detection stage and a 3D refinement stage. Key components include a CLIP-guided affordance-weighted frame selector, a manipulable-point validation module, and a Point Transformer-based 3D refinement with a multi-objective loss. On SceneFun3D, TASA achieves state-of-the-art accuracy (e.g., AP50 = 26.9, mIoU = 19.7) and notable efficiency gains (3.37× speedup, ~40% fewer FLOPs), underscoring the value of integrating 2D semantic priors with 3D geometry for robust, high-fidelity affordance segmentation in complex scenes.

Abstract

Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.

Paper Structure

This paper contains 28 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison among (a) object affordance understanding, (b) referring segmentation, and (c) scene affordance segmentation paradigms.
  • Figure 2: Overview of our Task-Aware 3D Scene-level Affordance segmentation framework (TASA).
  • Figure 3: Illustration of the Double-Check Mechanism.
  • Figure 4: Effect of the number of selected images $K$ on segmentation performance, evaluated using $\text{mIoU}$. Each experiment is conducted with a different Affordance Weight $\alpha_a$.
  • Figure 5: Qualitative comparison on SceneFun3D Split0. Point clouds are cropped around functional objects for improved visibility.