Table of Contents
Fetching ...

MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, Yichao Yan

TL;DR

MoRE tackles scalability and robustness in 3D visual geometry reconstruction by introducing a dense visual foundation model built on Mixture-of-Experts routing to task-specific heads. It couples a confidence-based depth refinement module with dense semantic feature fusion to improve depth reliability and surface normal detail, trained through tailored multi-task objectives. Empirical results show state-of-the-art performance across pointmap, monocular depth, camera pose, and normals benchmarks, without extra inference cost, highlighting MoRE’s versatility. The work delivers a scalable, adaptable backbone for diverse 3D vision applications such as AR/VR, robotics, and autonomous systems.

Abstract

Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.

MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

TL;DR

MoRE tackles scalability and robustness in 3D visual geometry reconstruction by introducing a dense visual foundation model built on Mixture-of-Experts routing to task-specific heads. It couples a confidence-based depth refinement module with dense semantic feature fusion to improve depth reliability and surface normal detail, trained through tailored multi-task objectives. Empirical results show state-of-the-art performance across pointmap, monocular depth, camera pose, and normals benchmarks, without extra inference cost, highlighting MoRE’s versatility. The work delivers a scalable, adaptable backbone for diverse 3D vision applications such as AR/VR, robotics, and autonomous systems.

Abstract

Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.

Paper Structure

This paper contains 19 sections, 14 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: MoE sparks 3D visual geometry Reconstruction to do MoRE. MoRE is a feed-forward foundation model that leverages mixture-of-experts in 3D visual geometry reconstruction. MoRE takes unposed images as input and outputs high-quality 3D pointmap, achieving robust geometric predictions for various scenarios.
  • Figure 2: Overview of MoRE. We propose MoRE, a dense visual foundation model featuring a mixture-of-experts architecture and multiple task-specific heads for geometric prediction. We adopt a two-stage strategy during the model training. In Stage 1, we supervise our model with the multi-task training objectives. In Stage 2, we incorporate mixture-of-experts to further train the model for robust and accurate visual geometry reconstruction.
  • Figure 3: Real-world depth comparison. We present the ground-truth depth, prediction from MoGe, the confidence mask and our prediction after training with confidence-based depth refinement.
  • Figure 4: Qualitative comparison of multi-view 3D reconstruction. Our method demonstrates superior accuracy and robustness across diverse scenarios compared to previous feed-forward approaches.
  • Figure 5: Ablation for confidence-based depth refinement. We demonstrate the effectiveness of the confidence-based depth refinement for more accurate depth estimation.
  • ...and 6 more figures