Table of Contents
Fetching ...

Matrix3D: Large Photogrammetry Model All-in-One

Yuanxun Lu, Jingyang Zhang, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao, Shiwei Li

TL;DR

Matrix3D presents a unified diffusion-transformer model for photogrammetry that jointly handles pose estimation, depth prediction, and novel view synthesis. By employing masked learning and multi-modal fusion across RGB, camera geometry via Plücker ray maps, and depth, it enables flexible input/output configurations and trains on partially labeled data. The approach achieves state-of-the-art results in pose estimation and novel view synthesis, and demonstrates competitive mono- and multi-view depth, as well as 3D reconstruction capabilities, with the added benefit of single- and few-shot generation. This all-in-one model simplifies the photogrammetry pipeline while providing rich interactive control for 3D content creation, and highlights practical potential for real-world reconstruction tasks where data is sparse or partially labeled.

Abstract

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.

Matrix3D: Large Photogrammetry Model All-in-One

TL;DR

Matrix3D presents a unified diffusion-transformer model for photogrammetry that jointly handles pose estimation, depth prediction, and novel view synthesis. By employing masked learning and multi-modal fusion across RGB, camera geometry via Plücker ray maps, and depth, it enables flexible input/output configurations and trains on partially labeled data. The approach achieves state-of-the-art results in pose estimation and novel view synthesis, and demonstrates competitive mono- and multi-view depth, as well as 3D reconstruction capabilities, with the added benefit of single- and few-shot generation. This all-in-one model simplifies the photogrammetry pipeline while providing rich interactive control for 3D content creation, and highlights practical potential for real-world reconstruction tasks where data is sparse or partially labeled.

Abstract

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.

Paper Structure

This paper contains 26 sections, 2 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Utilizing Matrix3D for single/few-shot reconstruction. Before 3DGS optimization, we complete the input set by pose estimation, depth estimation and novel view synthesis, all of which are done by the same model.
  • Figure 2: We train the Matrix3D by masked learning. Multi-modal data are randomly masked by noise corruption. Observations (green) and noisy maps (yellow) are fed into the encoder and the decoder respectively. By attaching the view and modality information to the clean and noisy inputs via different positional encodings, the model learns to denoise the corrupted maps and generate the desired outputs.
  • Figure 3: Sparse-view pose estimation results on CO3D dataset. The black axes are ground-truth and the colored ones are the estimation.
  • Figure 4: Qualitive evaluation results of novel view synthesis from single images on GSO and ARKitScenes dataset: a) random novel views; b) and c) follow the view configuration of SyncDreamer and Wonder3D respectively; d) indoor scenes from ARKitScenes dataset. Note that our method supports NVS of arbitrary poses.
  • Figure 5: Monocular 3D reconstruction. Additional novel view renderings of our method are shown in the last two columns.
  • ...and 5 more figures