Table of Contents
Fetching ...

Language-Image Models with 3D Understanding

Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone

TL;DR

Cube-LLM extends multi-modal language models to 3D grounding through the LV3D dataset and a unified 2D-3D pretraining framework, showing that scaling data suffices for 3D understanding without architecture changes. It leverages multi-turn QA, Visual Chain-of-Thought prompting, and specialist prompting to achieve robust 3D grounding and complex 3D reasoning, while remaining competitive on standard 2D benchmarks. The approach yields strong gains on Talk2Car and DriveLM for 3D grounding and driving-scenario reasoning, demonstrating practical potential for autonomous driving and 3D scene understanding. Overall, the work highlights that pure transformer-based scaling can unlock 3D perceptual capabilities when paired with careful data curation and flexible prompting.

Abstract

Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at https://janghyuncho.github.io/Cube-LLM.

Language-Image Models with 3D Understanding

TL;DR

Cube-LLM extends multi-modal language models to 3D grounding through the LV3D dataset and a unified 2D-3D pretraining framework, showing that scaling data suffices for 3D understanding without architecture changes. It leverages multi-turn QA, Visual Chain-of-Thought prompting, and specialist prompting to achieve robust 3D grounding and complex 3D reasoning, while remaining competitive on standard 2D benchmarks. The approach yields strong gains on Talk2Car and DriveLM for 3D grounding and driving-scenario reasoning, demonstrating practical potential for autonomous driving and 3D scene understanding. Overall, the work highlights that pure transformer-based scaling can unlock 3D perceptual capabilities when paired with careful data curation and flexible prompting.

Abstract

Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at https://janghyuncho.github.io/Cube-LLM.
Paper Structure (20 sections, 5 equations, 18 figures, 9 tables)

This paper contains 20 sections, 5 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: The overview of Cube-LLM for 3D-grounded reasoning. The task requires a model to take an image, understand the input text prompt (e.g., "Black Audi on left.") and ground it in 3-dimensional space.
  • Figure 1: Cube-LLM visual chain-of-thought prompting inference. First column is input image, the second column is the 2D bounding box prediction, and the third column is the final 3D bounding box prediction prompted with the 2D prediction and text.
  • Figure 2: Qualitative results of Cube-LLM 3D grounding in 3 aspects: open-vocabulary understanding (top), complex reasoning (middle), and 3D spatial understanding (bottom). Best viewed in color, zoomed.
  • Figure 2: More visualization of 3D grounding.Cube-LLM is capable of grounding object with spatial cues and understand complex questions.
  • Figure 3: Task-scaling for versatile I/O format. Decomposing the existing label formats for 3D grounding task. A complete 3D location can be decomposed into a center point ([x, y, z]), a depth ([z]), a (projected) 2D point ([x$_\text{c}$, y$_\text{c}$]), and a (projected) 2D box ([x1, y1, x2, y2]). We define various tasks that connect among these to train versatile I/O formats. Left: available (decomposed) annotations. Right: various tasks for training.
  • ...and 13 more figures