Table of Contents
Fetching ...

GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency

Dongyue Lu, Lingdong Kong, Tianxin Huang, Gim Hee Lee

TL;DR

GEAL addresses the limited generalization and robustness of 3D affordance learning by bridging 3D point clouds and rich 2D semantics through Gaussian Splatting, creating a 2D renderings branch from 3D data. A granularity-adaptive fusion and a 2D-3D consistency alignment module enable cross-modal knowledge transfer, allowing the 3D branch to inherit robust semantics from large-scale 2D foundation models. The paper introduces PIAD-C and LASO-C to holistically evaluate robustness under real-world corruptions, and demonstrates that GEAL outperforms state-of-the-art methods on seen/unseen object categories and under corrupt data. These results suggest a practical pathway to more reliable, cross-modal 3D affordance reasoning for robotics and human-machine interaction.

Abstract

Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction. However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption. We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models. To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions. Code and corruption datasets have been made publicly available.

GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency

TL;DR

GEAL addresses the limited generalization and robustness of 3D affordance learning by bridging 3D point clouds and rich 2D semantics through Gaussian Splatting, creating a 2D renderings branch from 3D data. A granularity-adaptive fusion and a 2D-3D consistency alignment module enable cross-modal knowledge transfer, allowing the 3D branch to inherit robust semantics from large-scale 2D foundation models. The paper introduces PIAD-C and LASO-C to holistically evaluate robustness under real-world corruptions, and demonstrates that GEAL outperforms state-of-the-art methods on seen/unseen object categories and under corrupt data. These results suggest a practical pathway to more reliable, cross-modal 3D affordance reasoning for robotics and human-machine interaction.

Abstract

Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction. However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption. We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models. To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions. Code and corruption datasets have been made publicly available.

Paper Structure

This paper contains 31 sections, 11 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: 3D affordance prediction under varied data noises. Given a textual prompt, previous methods like LASO li2024laso (right side of each example) exhibit reduced robustness across different corruption types. In contrast, our proposed method, GEAL (left side of each example), maintains high accuracy and generalization across these challenging scenarios by effectively transferring knowledge from a large-scale pre-trained 2D foundation model, enhancing robustness and adaptability under diverse conditions.
  • Figure 2: (Left): Framework Overview. The proposed GEAL consists of two branches: 3D and 2D. The 2D branch is established through 3D Gaussian Splatting to leverage the generalization capabilities of large pre-trained 2D models (cf. \ref{['sec:preliminaries']}). We then perform cross-modality alignment, including Granularity-Adaptive Visual-Textual Fusion and 2D-3D Consistency Alignment, to unify features from different modalities into a shared embedding space (cf. \ref{['sec:alignment']}). Finally, we decode generalizable affordance from this embedding space (cf. \ref{['sec:decoding']}). (Right): Architecture of the 2D-3D Consistency Alignment Module. This module maps features from 2D and 3D modalities into a shared embedding space and enforces consistency alignment to enable effective knowledge transfer across branches.
  • Figure 3: Illustration of the Granularity-Adaptive Fusion Module, it consists of a Flexible Granularity Feature Aggregation mechanism (a) and a Text-Conditioned Visual Alignment mechanism (b), we take the 2D branch as an example.
  • Figure 4: Qualitative comparisons between GEAL and LASO li2024laso on the PIAD yang2023grounding dataset. Top two rows display results on seen partition, while bottom two rows show results on unseen partition. Our method demonstrates strong generalization on both seen and unseen partitions. cf. supplementary material for more examples.
  • Figure 5: Visualization examples of the PIAD-C dataset. We show 7 corruption types across 5 severity levels.
  • ...and 3 more figures