Table of Contents
Fetching ...

WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, Chunchao Guo

TL;DR

WorldMirror tackles the limitations of task-specific 3D methods by introducing a universal, priors-aware feed-forward model for 3D reconstruction. It uses Multi-Modal Prior Prompting to fuse camera intrinsics, poses, and depth with image tokens, and a Universal Geometric Prediction head to output point maps, depths, camera parameters, normals, and 3D Gaussians, trained with curriculum learning and dynamic prior injection. Across diverse benchmarks including 7-Scenes, DTU, RealEstate10K, ScanNet, and VR-NeRF, it achieves state-of-the-art results while maintaining efficient inference. Ablations show the benefits of compact single-token priors and the necessity of 3DGS supervision and gradient-consistency losses. This framework advances unified 3D scene understanding and offers practical gains for multi-task geometry estimation and novel-view synthesis.

Abstract

We present WorldMirror, an all-in-one, feed-forward model for versatile 3D geometric prediction tasks. Unlike existing methods constrained to image-only inputs or customized for a specific task, our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps, while simultaneously generating multiple 3D representations: dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussians. This elegant and unified architecture leverages available prior information to resolve structural ambiguities and delivers geometrically consistent 3D outputs in a single forward pass. WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis, while maintaining the efficiency of feed-forward inference. Code and models will be publicly available soon.

WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting

TL;DR

WorldMirror tackles the limitations of task-specific 3D methods by introducing a universal, priors-aware feed-forward model for 3D reconstruction. It uses Multi-Modal Prior Prompting to fuse camera intrinsics, poses, and depth with image tokens, and a Universal Geometric Prediction head to output point maps, depths, camera parameters, normals, and 3D Gaussians, trained with curriculum learning and dynamic prior injection. Across diverse benchmarks including 7-Scenes, DTU, RealEstate10K, ScanNet, and VR-NeRF, it achieves state-of-the-art results while maintaining efficient inference. Ablations show the benefits of compact single-token priors and the necessity of 3DGS supervision and gradient-consistency losses. This framework advances unified 3D scene understanding and offers practical gains for multi-task geometry estimation and novel-view synthesis.

Abstract

We present WorldMirror, an all-in-one, feed-forward model for versatile 3D geometric prediction tasks. Unlike existing methods constrained to image-only inputs or customized for a specific task, our framework flexibly integrates diverse geometric priors, including camera poses, intrinsics, and depth maps, while simultaneously generating multiple 3D representations: dense point clouds, multi-view depth maps, camera parameters, surface normals, and 3D Gaussians. This elegant and unified architecture leverages available prior information to resolve structural ambiguities and delivers geometrically consistent 3D outputs in a single forward pass. WorldMirror achieves state-of-the-art performance across diverse benchmarks from camera, point map, depth, and surface normal estimation to novel view synthesis, while maintaining the efficiency of feed-forward inference. Code and models will be publicly available soon.

Paper Structure

This paper contains 21 sections, 11 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: WorldMirror is a large feed-forward 3D reconstruction model that takes raw images along with optional priors (depth, calibrated intrinsics, camera pose) as input and produces high-quality geometric attributes in seconds, including point clouds, 3DGS, cameras, depth, and normal maps.
  • Figure 2: Overview of WorldMirror. Given multi-view images with optional priors (depths, calibrated intrinsics, camera poses) as input, our framework encodes each prior modality into tokens and integrates them with image tokens. The composite tokens are subsequently processed by a visual transformer backbone to effectively aggregate multi-view features. The consolidated representations are then passed to multi-task heads to generate comprehensive geometric outputs, including point maps, camera parameters, multi-view depth maps, surface normals, and 3D Gaussians.
  • Figure 3: Feed-Forward 3D Gaussians Predicted by WorldMirror with In-The-Wild Inputs. Besides real photos, our method generalizes well to AI-created videos spanning diverse styles.
  • Figure 4: Qualitative Comparisons of Novel View Synthesis. We compare with FLARE and AnySplat on RealEstate10K and DL3DV. The first four columns correspond to the sparse-view setting, while the latter three correspond to the dense-view setting. Our approach surpasses baselines in both appearance fidelity and geometric perception.
  • Figure 5: Geometric Priors Unlock Enhanced Scene Reconstruction of WorldMirror. (Top) Camera poses help the model to capture relative view positions accurately. (Middle) Calibrated intrinsic enhances the reconstruction by enabling precise projection modeling and geometry alignment. (Bottom) Depth guidance enables the network to better handle challenging reconstruction scenarios, like perspective distortion, unusual geometric configurations, or partial occlusions.
  • ...and 6 more figures