Table of Contents
Fetching ...

Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, Ping Jian

TL;DR

GEODE tackles the core challenge of 3D spatial reasoning in Vision Language Models by decoupling 3D reasoning from numerical generation. The Decoupled Rationale Module (DRM) fuses explicit 3D data with 2D visual features and distills spatial CoT into injectable <Spatio> Rationale Tokens, while the Direct Regression Head (DRH) converts specialized control token embeddings into continuous outputs using an Embedding-as-Value paradigm. The two-stage training—DRM pretraining followed by joint DRH+VLM finetuning—yields a parameter-efficient architecture (1.5B) that rivals 7B+ models on VSI-Bench, particularly on distance and 3D object localization tasks. This approach offers practical impact for embodied AI and robotics by enabling accurate 3D reasoning with modest computational resources, and it opens avenues for extending regression targets beyond scalars and 3D boxes.

Abstract

Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.

Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression

TL;DR

GEODE tackles the core challenge of 3D spatial reasoning in Vision Language Models by decoupling 3D reasoning from numerical generation. The Decoupled Rationale Module (DRM) fuses explicit 3D data with 2D visual features and distills spatial CoT into injectable <Spatio> Rationale Tokens, while the Direct Regression Head (DRH) converts specialized control token embeddings into continuous outputs using an Embedding-as-Value paradigm. The two-stage training—DRM pretraining followed by joint DRH+VLM finetuning—yields a parameter-efficient architecture (1.5B) that rivals 7B+ models on VSI-Bench, particularly on distance and 3D object localization tasks. This approach offers practical impact for embodied AI and robotics by enabling accurate 3D reasoning with modest computational resources, and it opens avenues for extending regression targets beyond scalars and 3D boxes.

Abstract

Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Performance comparison of GEODE with other mainstream VLM models on VSI-bench yangThinkingSpaceHow2024. Our model with only 1.5B parameters achieves SOTA overall performance, especially on Object Count and Absolute Distance tasks.
  • Figure 2: Overview of the GEODE architecture, contrasted with standard VLMs. (Top-Left) Standard VLMs are architecturally rooted in 2D perception, and their discrete tokenizers are ill-suited for generating precise, continuous numerical values. (Bottom-Left) GEODE architecture resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. Decoupled Rationale Module (DRM) acts as spatial co-processor, fusing 2D visual features with 3D pointcloud data and distilling the logic into injectable <Spatio> Rationale Tokens to solve the input bottleneck. Direct Regression Head (DRH) implements "Embedding-as-Value" paradigm, intercepting <REG> control tokens and regressing their embeddings directly into continuous values to solve the output bottleneck. (Right) A Spatial Chain-of-Thought (CoT) sample used for training DRM to generate Rationale Tokens encapsulate the underlying reasoning process.
  • Figure 3: The two-stage training paradigm of GEODE. Stage 1: Reasoning Rationale Pretraining (DRM). The main LLM parameters are frozen. Only the DRM is trained, optimized via a Rationale-Guided Reconstruction loss ($\mathcal{L}_{DRM}$) to generate <Spatio> embeddings that the frozen LLM can autoregressively reconstruct into the corresponding textual reasoning rationale. Stage 2: Numerical Regression and Joint Finetuning (DRH). The pretrained DRM is frozen, and its <Spatio> tokens are injected as context. The VLM backbone and the newly initialized DRH are jointly finetuned using a mixed-loss objective: Cross-Entropy ($\mathcal{L}_{CE}$) for text generation and L2 regression ($\mathcal{L}_{DRH}$) for numerical outputs routed to the DRH.
  • Figure 4: Quantitive results for GEODE, SFT only baseline and Qwen2.5-VL-3B baiQwen25VLTechnicalReport2025.