Table of Contents
Fetching ...

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan

TL;DR

This work constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss, and achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation.

Abstract

The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

TL;DR

This work constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss, and achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation.

Abstract

The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.
Paper Structure (17 sections, 3 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 3 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our method. Stage 1 adopts the three training recipes of MiniGPT-3D for pre-training. Stage 2 freezes the point cloud encoder, MLP, Q-Former, and modality projector, and only trains the LoRA layers of the LLM and the alignment projector. The alignment projector aligns the latent representations of point cloud tokens in the LLM with the Q-Former output through cosine similarity loss. Flame icons indicate trainable modules, and snowflakes indicate frozen modules.
  • Figure 2: Examples of 3D Object Understanding with Our Model. The figure demonstrates our model's capabilities in the context of 3D object understanding, showcasing its performance on tasks such as 3D object recognition, description generation, and 3D VQA.
  • Figure 3: KNN classification accuracy of point cloud tokens extracted from different LLM layers on ModelNet40. We compare the baseline model and our aligned model using K=1 and K=10.
  • Figure 4: Impact of training data fraction on 3D object captioning performance. We evaluate baseline and aligned models using 10%, 30%, 50%, 70%, and 100% of training data on Objaverse dataset.