Table of Contents
Fetching ...

GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Jiaxin Zhang, Junjun Jiang, Haijie Li, Youyu Chen, Kui Jiang, Dave Zhenyu Chen

Abstract

Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.

GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Abstract

Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.
Paper Structure (41 sections, 3 equations, 13 figures, 14 tables)

This paper contains 41 sections, 3 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Geometry-aligned pre-training significantly improves 3D perception in MLLMs. (A) Naive fusion without geometry-aware pre-training leads to limited geometric utilization and inaccurate 3D detection. (B) Our sparse geometry–semantics joint pre-training and multi-level fusion module progressively integrate geometric priors, activating structural perception and yielding substantial gains in 3D detection mAP.
  • Figure 2: Geometry–semantics imbalance in existing 3D perception paradigms. Point cloud-based methods ensure geometric accuracy but lack semantics, whereas image-based geometric encoders (GE) with VLMs retain semantics yet under-utilize geometry. As a result, both paradigms exhibit suboptimal performance in 3D perception.
  • Figure 3: Failure Analysis. Global attention maps of the geometric encoder are visualized using three representative tokens across layers. The same token exhibits distinct attention patterns at different layers. Prior method that relies on last-layer tokens as implicit geometric priors zheng2025learning leads to inaccurate 3D bounding box prediction.
  • Figure 4: Network Architecture. An image sequence is processed by parallel geometric and visual branches to extract multi-level structural and semantic tokens. After gated multi-level fusion, the final-layer fused tokens are aligned with task-related textual representations in the video LLM decoder, while selected intermediate-layer tokens are injected into early decoder blocks to preserve hierarchical geometric information.
  • Figure 5: Training Strategy. Sparse joint pre-training (left) activates 3D representations under a unified first-frame metric coordinate system. The learned representations are then transferred to object-level fine-tuning (right) for downstream 3D perception.
  • ...and 8 more figures