Table of Contents
Fetching ...

Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

Zhuoxu Huang, Mingqi Gao, Jungong Han

TL;DR

The Point Linguist Model is presented, a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images and introduces Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective.

Abstract

3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.

Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

TL;DR

The Point Linguist Model is presented, a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images and introduces Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective.

Abstract

3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.

Paper Structure

This paper contains 46 sections, 18 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Comparison between (a) previous approaches and (b) our PLM. Given an input point cloud (colors denote different objects), previous approaches partition points into patches, ignoring object boundaries and target-level semantics. In contrast, PLM constructs an Object-centric Discriminative Representation (OcDR) from dense point features under distractor-aware supervision, capturing target-level semantics and explicitly assigning each token to a specific object while preserving inter-object differentiation. At output, previous approaches rely solely on dense scene features for the final prediction. To leverage geometric cues within the LLM reasoning pipeline, PLM injects dense features into the LLM and reactivates preserved details via the Geometric Reactivation Decoder (GRD).
  • Figure 2: Example for distractor. In a complex scene where multiple objects are semantically related to the instruction (i.e., "Chair"), only one is the ground truth.
  • Figure 3: Overall architecture of the proposed Point Linguist Model. We propose OcDR to bridge the input pipeline from dense point cloud input to multi-modal LLM interaction, and design GRD to bridge the output pipeline from LLM outputs to dense segmentation. The proposed model can handle different tasks by adapting to different language instructions.
  • Figure 4: Visualization results of PLM in different segmentation tasks. (i) Our model can easily reason and comprehend implicit user instructions. (ii) and (iii) Our model enables flexible segmentation of multiple objects with clear instance separation. Different highlight colors represent different instances.
  • Figure 5: Visualization results on reasoning expression segmentation. We obtain those data from the partially open-source Instruc3D he2025segpoint dataset. The segmented results are highlighted. Zoom in for better details.
  • ...and 1 more figures