Table of Contents
Fetching ...

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

Jiangye Yuan, Gowri Kumar, Baoyuan Wang

TL;DR

This work presents a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs, and boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

TL;DR

This work presents a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs, and boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding.

Abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
Paper Structure (14 sections, 1 equation, 4 figures, 2 tables)

This paper contains 14 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An overview of GR3D framework. Given a collection of images, our method reconstructs 3D scenes, extracts object-level geometric attributes, and transform them into a GR3D representation, which consists of annotated images and textual references of geometric attributes. Such paired text and images are provided to a MLLM to perform spatial reasoning tasks.
  • Figure 2: Object annotation with occlusion check. Left: initial object annotation through projection. Middle: depth map from 3D reconstruction. Right: object annotation after depth-based occlusion check.
  • Figure 3: Prompt template for VSI-Bench evaluation.
  • Figure 4: Example results of sparse view spatial reasoning. The part with incorrect reasoning is highlighted in red.