Language-Assisted 3D Scene Understanding
Yanmin Wu, Qiankun Gao, Renrui Zhang, Jian Zhang
TL;DR
LAST-PCL addresses data scarcity in 3D point cloud understanding by injecting language priors through LLM-generated, fine-grained text descriptions and aligning them with point features via a text-contrastive framework. A training-free, statistically driven feature selection preserves textual priors while reducing dimensionality and redundancy, enabling efficient cross-modal supervision without paired triplets. Across 3D semantic segmentation, object detection, and scene classification, the method achieves state-of-the-art or competitive results, with notable gains on large or fine-grained category sets and robust generalization to new text queries. The work highlights the role of semantic-rich text in guiding 3D perception and demonstrates flexible, scalable inference through language, including potential image-query enhancements via CLIP.
Abstract
The scale and quality of point cloud datasets constrain the advancement of point cloud learning. Recently, with the development of multi-modal learning, the incorporation of domain-agnostic prior knowledge from other modalities, such as images and text, to assist in point cloud feature learning has been considered a promising avenue. Existing methods have demonstrated the effectiveness of multi-modal contrastive training and feature distillation on point clouds. However, challenges remain, including the requirement for paired triplet data, redundancy and ambiguity in supervised features, and the disruption of the original priors. In this paper, we propose a language-assisted approach to point cloud feature learning (LAST-PCL), enriching semantic concepts through LLMs-based text enrichment. We achieve de-redundancy and feature dimensionality reduction without compromising textual priors by statistical-based and training-free significant feature selection. Furthermore, we also delve into an in-depth analysis of the impact of text contrastive training on the point cloud. Extensive experiments validate that the proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks.
