Table of Contents
Fetching ...

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid

TL;DR

The paper addresses the challenge of building a generalist 3D vision-language model capable of dialogue, grounding, and reasoning in 3D scenes. It introduces the Omni Superpoint Transformer (OST) as a multi-purpose visual connector integrated with a Sparse 3D U-Net encoder and a prompt-encoding mechanism, trained with hybrid supervision and instruction tuning. Experiments across ScanNet-derived datasets show state-of-the-art performance on 3D vision-language tasks, including high CiDEr on ScanQA and strong 3D referring segmentation results. The work demonstrates that a unified architecture with a versatile visual connector can simplify 3D LMM pipelines and improve practicality, though data collection remains a key future challenge.

Abstract

Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks.

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

TL;DR

The paper addresses the challenge of building a generalist 3D vision-language model capable of dialogue, grounding, and reasoning in 3D scenes. It introduces the Omni Superpoint Transformer (OST) as a multi-purpose visual connector integrated with a Sparse 3D U-Net encoder and a prompt-encoding mechanism, trained with hybrid supervision and instruction tuning. Experiments across ScanNet-derived datasets show state-of-the-art performance on 3D vision-language tasks, including high CiDEr on ScanQA and strong 3D referring segmentation results. The work demonstrates that a unified architecture with a versatile visual connector can simplify 3D LMM pipelines and improve practicality, though data collection remains a key future challenge.

Abstract

Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks.
Paper Structure (14 sections, 3 equations, 5 figures, 5 tables)

This paper contains 14 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An intuitive comparison between 3D-LLaVA and other SoTA 3D LMMs (The performance of LEO on ScanQA is omitted here since its setting is different). Our 3D-LLaVA achieves the best results among the competitors on most of the benchmarks.
  • Figure 2: An overview of 3D-LLaVA framework. Given input point cloud, language instruction, and optional visual prompt, 3D-LLaVA generates text output from LLM and produces 3D masks with Omni Superpoint Transformer (OST). The 3D feature out of the Sparse 3D U-Net is clustered into superpoint with Superpoint Pooling. Visual Sampler is a parameter-free module that samples point features corresponding to the visual prompt $X_P$. Omni Superpoint Transformer takes both superpoint feature and visual prompt feature as input, produces visual feature embedding $Z_V$ and visual prompt embedding $Z_P$, followed by a projection layer $W_V$ to obtain the token embedding $H_V$ and $H_P$. Once the LLM outputs a special segmentation token, i.e., [SEG], the hidden state linked to [SEG] token will be sent to another projection layer $W_S$ and then input as segmentation query to the frozen OST to generate segmentation masks.
  • Figure 3: An illustration of (a) the architecture of Omni Superpoint Transformer, and (b) the paradigm of the visual sampler.
  • Figure 4: Different paradigms to produce visual prompt embedding. "OST": Omni Superpoint Transformer. "P.E. Encoder": Parameter-Free Encoder.
  • Figure 5: Visualization of 3D-LLaVA's response on various tasks. Each of these examples includes an instruction to perform referring segmentation. Besides, the examples present the result of 3D question answering azuma2022scanqa, 3D dense captioning zhong2022contextual3DdenseCap, and situated question answering ma2022sqa3d, respectively. When the referred object is not in the given 3D scene, the model is aware of responding with "Sorry, I cannot find this object".