3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid
TL;DR
The paper addresses the challenge of building a generalist 3D vision-language model capable of dialogue, grounding, and reasoning in 3D scenes. It introduces the Omni Superpoint Transformer (OST) as a multi-purpose visual connector integrated with a Sparse 3D U-Net encoder and a prompt-encoding mechanism, trained with hybrid supervision and instruction tuning. Experiments across ScanNet-derived datasets show state-of-the-art performance on 3D vision-language tasks, including high CiDEr on ScanQA and strong 3D referring segmentation results. The work demonstrates that a unified architecture with a versatile visual connector can simplify 3D LMM pipelines and improve practicality, though data collection remains a key future challenge.
Abstract
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks.
