Table of Contents
Fetching ...

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Yiwen Tang, Ray Zhang, Jiaming Liu, Zoey Guo, Dong Wang, Zhigang Wang, Bin Zhao, Shanghang Zhang, Peng Gao, Hongsheng Li, Xuelong Li

TL;DR

This paper introduces Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding and inserts an any-to-3D guided adapter module for parameter-efficient fine-tuning.

Abstract

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

TL;DR

This paper introduces Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding and inserts an any-to-3D guided adapter module for parameter-efficient fine-tuning.

Abstract

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.
Paper Structure (41 sections, 3 equations, 7 figures, 10 tables)

This paper contains 41 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of Any2Point. We propose a general framework for any-to-3D learning, which is shared for any modalities with parameter-efficient fine-tuning.
  • Figure 2: Overall Pipeline of Any2Point. For efficiently fine-tuning Any-modality pre-trained models, our Any2Point framework contains two components: a 3D-to-any Virtual Projection, which pairs the pre-trained positional encodings with 3D tokens to avoid the 3D geometric information loss, and an Any-to-3D Guided Adapter to effectively grasp local structures.
  • Figure 3: 3D-to-any Virtual Projection. To prevent the loss of 3D geometric information, the module assigns 3D tokens with the positional encodings that are paired with the pre-trained model.
  • Figure 4: Any-to-3D Guided Adapter. Inserted into every transformer block, the adapter leverages the 1D/2D-guided Local Aggregation module to capture 3D local semantics and utilizes the Adaptive Any-to-3D Ensemble to obtain high-quality features.
  • Figure 5: Visualization of Different Positional Encoding Methods. For the 1D/2D modalities, we visualize the attention scores of the [CLS] token to other point cloud tokens, utilizing sinusoidal positional encoding, learnable positional encoding, and 3D-to-any Virtual Projection. The red color indicates higher values.
  • ...and 2 more figures