Table of Contents
Fetching ...

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen

TL;DR

This work tackles the semantic ambiguity of 3D point clouds for affordance segmentation by transferring rich 2D semantic knowledge from Vision Foundation Models to 3D. It introduces Cross-Modal Affinity Transfer (CMAT) to align a 3D encoder with lifted 2D semantics using three objectives: geometric reconstruction, affinity alignment, and feature diversity, producing a structured 3D representation. Building on this backbone, the Cross-modal Affordance Segmentation Transformer (CAST) fuses 3D features with multi-modal prompts (text and images) to generate dense, prompt-aware segmentation maps, achieving state-of-the-art results on PIAD, PIADv2, and LASO. The approach provides a generalizable framework for injecting 2D semantic knowledge into 3D understanding, with strong implications for robotic manipulation and embodied AI.

Abstract

Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

TL;DR

This work tackles the semantic ambiguity of 3D point clouds for affordance segmentation by transferring rich 2D semantic knowledge from Vision Foundation Models to 3D. It introduces Cross-Modal Affinity Transfer (CMAT) to align a 3D encoder with lifted 2D semantics using three objectives: geometric reconstruction, affinity alignment, and feature diversity, producing a structured 3D representation. Building on this backbone, the Cross-modal Affordance Segmentation Transformer (CAST) fuses 3D features with multi-modal prompts (text and images) to generate dense, prompt-aware segmentation maps, achieving state-of-the-art results on PIAD, PIADv2, and LASO. The approach provides a generalizable framework for injecting 2D semantic knowledge into 3D understanding, with strong implications for robotic manipulation and embodied AI.

Abstract

Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.

Paper Structure

This paper contains 24 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Qualitative comparison of 3D feature representations. We visualize learned features across different objects to highlight the semantic organization within 3D representations. The 2D Semantics column presents clustering results obtained by lifting features from multi-view renderings and encoding them with a 2D Vision Foundation Model(e.g. DINOv3 dinov3), which reveal distinct functional regions such as handles and seats. The 3D-Only column shows features produced solely by a 3D encoder (e.g. PointNet++ qi2017pointnetplusplus), projected into RGB space using PCA, where the boundaries between functional areas remain fuzzy and less coherent. The Ours column illustrates our distilled features, which transfer semantic knowledge from 2D models into the 3D domain. Compared to the 3D-only baseline, our method produces features that exhibit stronger semantic organization, clearer separation of functional parts, and more consistent region boundaries across object categories.
  • Figure 2: Comparison of Geometry-Only vs. Semantic-Grounded Methods for 3D Representation Learning. (a) A 3D-only encoder produces an unstructured feature space where functional parts are indistinguishable. (b) Our semantic-grounded method leverages knowledge from multi-view 2D images to produce a highly structured feature space, where functional parts like the handle are clearly separated, enabling an intuitive understanding of how to interact with the object.
  • Figure 3: An overview of our three-stage learning framework. In Stage 2 (Structured Representation Learning), we pre-train a 3D backbone ($\Phi_{3D}$) using our proposed Cross-Modal Affinity Transfer (CMAT) objective. The training is guided by a combination of three complementary losses: geometric reconstruction, affinity alignment, and feature diversity. In Stage 3 (Prompt-driven Task Adaptation), the resulting structurally-aware backbone is fine-tuned for affordance segmentation. Our Cross-modal Affordance Segmentation Transformer (CAST) architecture ingests the pre-trained 3D features and fuses them with a multi-modal prompt (visual or textual) to produce the final segmentation.
  • Figure 4: The architecture of our Cross-modal Affordance Segmentation Transformer (CAST). CAST takes geometric patch tokens from our pre-trained 3D backbone and fuses them with multi-modal prompts (text and/or visual). First, all input features are projected into a shared embedding space, and learnable modality embeddings are added to preserve source identity. The geometric and prompt tokens are then concatenated and processed by a stack of co-attentional Transformer blocks. Within this module, self-attention enables deep, bidirectional fusion, allowing the geometric features to become conditioned on the prompt and the prompt to ground itself in the 3D shape. Finally, the updated, prompt-aware patch features are upsampled to the per-point level and processed by a lightweight MLP head to produce the final segmentation mask.
  • Figure 5: Qualitative comparison on challenging cases from the PIADv2 great (visual prompt) and LASO LASO (text prompt) datasets. These examples visually corroborate our quantitative improvements and highlight our framework's superior fine-grained segmentation capability.