Table of Contents
Fetching ...

More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Yuan Tang, Xu Han, Xianzhi Li, Qiao Yu, Jinfeng Xu, Yixue Hao, Long Hu, Min Chen

TL;DR

GreenPLM is introduced, which leverages more text data to compensate for the lack of 3D data and requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding.

Abstract

Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.

More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

TL;DR

GreenPLM is introduced, which leverages more text data to compensate for the lack of 3D data and requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding.

Abstract

Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.
Paper Structure (34 sections, 5 equations, 20 figures, 9 tables)

This paper contains 34 sections, 5 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: We propose GreenPLM, which expands the text space to reduce the need for 3D data. GreenPLM achieves strong 3D understanding using just 12% of the 3D data or even with text-only data.
  • Figure 2: Existing methods like PointLLM use massive 3D-text data ($\sim$730K) to enhance the point-text mapping, therefore realize point-language understanding, while we can also achieve this with only a small number of 3D data ($\sim$90K) and free-text descriptions for better point-LLM alignment.
  • Figure 3: T3D dataset distribution.
  • Figure 4: Illustration of 3-Stage Training Strategy. We expand the text space by feeding more text data in Stage I & II, thus reduce the demand of 3D data in Stage III. We input the text/point cloud to the encoders, then align with LLM via a MLP projector. Additionally, we design a 0M-Pooling module to efficiently compress the token sequence output by point encoder.
  • Figure 5: Illustration of 0M-Pooling, which compresses $N$ tokens to $M$ tokens ($M<<N$).
  • ...and 15 more figures