Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Yu Huang; Zelin Peng; Changsong Wen; Xiaokang Yang; Wei Shen

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen

TL;DR

This work tackles the semantic ambiguity of 3D point clouds for affordance segmentation by transferring rich 2D semantic knowledge from Vision Foundation Models to 3D. It introduces Cross-Modal Affinity Transfer (CMAT) to align a 3D encoder with lifted 2D semantics using three objectives: geometric reconstruction, affinity alignment, and feature diversity, producing a structured 3D representation. Building on this backbone, the Cross-modal Affordance Segmentation Transformer (CAST) fuses 3D features with multi-modal prompts (text and images) to generate dense, prompt-aware segmentation maps, achieving state-of-the-art results on PIAD, PIADv2, and LASO. The approach provides a generalizable framework for injecting 2D semantic knowledge into 3D understanding, with strong implications for robotic manipulation and embodied AI.

Abstract

Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

TL;DR

Abstract

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)