Table of Contents
Fetching ...

Do 3D Large Language Models Really Understand 3D Spatial Relationships?

Xianzheng Ma, Tao Sun, Shuai Chen, Yash Bhalgat, Jindong Gu, Angel X Chang, Iro Armeni, Iro Laina, Songyou Peng, Victor Adrian Prisacariu

Abstract

Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: https://real-3dqa.github.io/.

Do 3D Large Language Models Really Understand 3D Spatial Relationships?

Abstract

Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: https://real-3dqa.github.io/.
Paper Structure (36 sections, 6 equations, 11 figures, 7 tables)

This paper contains 36 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Our Real-3DQA Benchmark and Solution. We demonstrate the answer differences between 3D-LLM (LEO) and its blind-finetuned version on various questions, with correct answers highlighted in green frames and incorrect ones in red frames. We discover that some questions can be answered correctly regardless of whether 3D information is used, which we consider to be 3D-independent questions. By filtering out these 3D-independent questions and introducing a new viewpoint rotation score, the original models' performance drops significantly on Real-3DQA. Finally, using our proposed 3D-aware Reweighted Finetuning strategy, performance improves again.
  • Figure 2: Our finding: A language model fine-tuned only on text QA pairs without any 3D inputs (Blind Finetuned) can match or even surpass state-of-the-art 3D-LLMs (Original) on multiple 3D-QA benchmarks. This exposes a critical weakness in current benchmark design and calls into question their ability to assess genuine 3D reasoning despite linguistic shortcuts.
  • Figure 3: Overview of Real-3DQA Construction Process. Real-3DQA provides a fair and rigorous evaluation framework for 3D spatial reasoning in 3D-LLMs. The construction process begins with Filtering 3D-independent Questions, which removes questions that can be correctly answered by both the 3D-LLM model $M_x$ and its text-only $M_x^{blind}$ counterpart, as well as those answerable by the GPT model without 3D input. The remaining high-quality questions $Q_{Final}$ are then augmented using GPT, generating spatially consistent variations through viewpoint rotations while preserving the underlying 3D relationships. Finally, expert reviews eliminate redundancy and invalid data, ensuring the highest dataset quality.
  • Figure 4: Viewpoint Rotation Augmentation. Real-3DQA generates viewpoint-augmented SQA instances to enforce spatial reasoning. The left panel shows the original room layout, where an agent asks "What is on my right?" with the correct answer "white board." The right panels illustrate the SQA examples of the original viewpoint and rotated viewpoint (90°, 180°, 270°), where the agent’s perspective shifts and the correct answers dynamically adjust while preserving spatial consistence .
  • Figure 4: Ablation Study on Training Strategies. Columns group results by model and dataset: LEO on ScanQA/Real-ScanQA (left) and SQA3D/Real-3DQA (center), and Chat-Scene on SQA3D/Real-3DQA (right). 3D-reweighted fine-tuning (3DR-FT) delivers consistent gains across both datasets and models, with the largest improvements on the 3D-dependent sets—Real-3DQA and Real-ScanQA—while Supervised FT remains strongest on SQA3D.
  • ...and 6 more figures