Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression
Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, Ping Jian
TL;DR
GEODE tackles the core challenge of 3D spatial reasoning in Vision Language Models by decoupling 3D reasoning from numerical generation. The Decoupled Rationale Module (DRM) fuses explicit 3D data with 2D visual features and distills spatial CoT into injectable <Spatio> Rationale Tokens, while the Direct Regression Head (DRH) converts specialized control token embeddings into continuous outputs using an Embedding-as-Value paradigm. The two-stage training—DRM pretraining followed by joint DRH+VLM finetuning—yields a parameter-efficient architecture (1.5B) that rivals 7B+ models on VSI-Bench, particularly on distance and 3D object localization tasks. This approach offers practical impact for embodied AI and robotics by enabling accurate 3D reasoning with modest computational resources, and it opens avenues for extending regression targets beyond scalars and 3D boxes.
Abstract
Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.
