Table of Contents
Fetching ...

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen

TL;DR

The paper tackles the challenge of converting dense 3D point clouds into compact, LLM-friendly tokens for flexible language-driven reasoning. It introduces NDTokenizer3D, which combines a multi-scale NDT-based tokenization with a Multi-Scale NDT Decoder (MSDec) to fuse scale-separated information into holistic scene tokens, while also serving as a unified interface for human prompting and segmentation within the same framework. Training proceeds in two stages: Stage 1 pre-trains the 3D encoder and MSDec on 3D Instance Segmentation with semantic and mask heads plus CLIP-based 2D–3D supervision; Stage 2 freezes the encoder and MSDec and performs instruction tuning of projection heads and the LLM on Referring Segmentation, VQA, and Dense Captioning, using losses $\mathcal{L}_t$, $\mathcal{L}_m$, and $\mathcal{L}_s$. Experiments on ScanNet-based benchmarks, including Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, show that NDTokenizer3D achieves strong cross-task performance with reduced hallucinations, demonstrating the effectiveness of information-preserving 3D scene tokens for grounded, interactive 3D vision–language understanding.

Abstract

Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

TL;DR

The paper tackles the challenge of converting dense 3D point clouds into compact, LLM-friendly tokens for flexible language-driven reasoning. It introduces NDTokenizer3D, which combines a multi-scale NDT-based tokenization with a Multi-Scale NDT Decoder (MSDec) to fuse scale-separated information into holistic scene tokens, while also serving as a unified interface for human prompting and segmentation within the same framework. Training proceeds in two stages: Stage 1 pre-trains the 3D encoder and MSDec on 3D Instance Segmentation with semantic and mask heads plus CLIP-based 2D–3D supervision; Stage 2 freezes the encoder and MSDec and performs instruction tuning of projection heads and the LLM on Referring Segmentation, VQA, and Dense Captioning, using losses , , and . Experiments on ScanNet-based benchmarks, including Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, show that NDTokenizer3D achieves strong cross-task performance with reduced hallucinations, demonstrating the effectiveness of information-preserving 3D scene tokens for grounded, interactive 3D vision–language understanding.

Abstract

Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

Paper Structure

This paper contains 15 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: We introduce NDTokenizer3D, a generalist 3D VLM that bridges language-level reasoning with spatial understanding. By tokenizing complex 3D scenes into information-rich representations, NDTokenizer3D enables diverse tasks such as 3D Visual Question Answering, Dense captioning, and Referring Segmentation within a unified and interactive framework.
  • Figure 2: NDTokenizer3D is a general-purpose 3D VLM that supports a wide range of 3D understanding tasks. The model introduces a novel three-stage scene tokenization pipeline that constructs multi-scale NDT representations and aggregates them via MSDec to generate holistic scene tokens. The lower-left shows MSDec with $R$ transformer decoder layers that integrate multi-scale NDT features, using them as Key and Value. Beyond feature integration, MSDec also acts as a unified interface for user prompting and mask decoding.
  • Figure 3: Qualitative comparison between NDTokenizer3D and 3D-LLaVA across four tasks, showing NDTokenizer3D's improved grounding, spatial reasoning, and object understanding.