Table of Contents
Fetching ...

3D-IDE: 3D Implicit Depth Emergent

Chushan Zhang, Ruihan Lu, Jinguang Tong, Yikai Wang, Hongdong Li

Abstract

Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at github.com/ChushanZhang/3D-IDE.

3D-IDE: 3D Implicit Depth Emergent

Abstract

Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at github.com/ChushanZhang/3D-IDE.

Paper Structure

This paper contains 31 sections, 13 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Comparison of 3D-aware designs for video-LLMs.(a) Explicit coordinate injection fuses 2D features with coarse 3D positional embeddings and requires 3D inputs at inference. (b) Dual encoders separately process RGB and geometry, then fuse their outputs, increasing complexity and latency. (c) 3D-IDE uses a single visual encoder trained so that 3D awareness emerges implicitly, enabling efficient RGB-only inference.
  • Figure 2: Illustration of the double information loss in explicit coordinate injection. (a) RGB frame with a 2D patch whose pixels are back-projected to point cloud. (b) Pooling collapses all patch points into one token, losing local structure. (c) Voxelization merges distinct 3D points into the same voxel, further degrading fine-grained geometry and harming downstream 3D reasoning.
  • Figure 3: The 3D-IDE framework. Our approach avoids the "Double Information Loss" (see \ref{['fig:double_information loss']}) inherent in explicit coordinate injection methods. Instead of injecting coarse, lossy coordinates, we use a privileged training module (green box) that is detached at inference for zero latency. This module forces the model to learn a fine-grained 3D representation implicitly via two parallel gradient signals (green arrows): a geometric gradient from a weak depth validator and a global gradient from a frozen foundation model guidance.
  • Figure 4: Qualitative results on three 3D vision-language tasks.
  • Figure 5: More qualitative results on three 3D vision-language tasks: language-guided object localization (top), region-level captioning (middle), and spatial question answering (bottom). In the grounding examples, green 3D bounding boxes denote the ground-truth targets, red boxes the predictions of the baseline, and blue boxes the predictions of our model. Our method better aligns with the targets and produces more accurate captions and answers.
  • ...and 2 more figures