Matrix3D: Large Photogrammetry Model All-in-One
Yuanxun Lu, Jingyang Zhang, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao, Shiwei Li
TL;DR
Matrix3D presents a unified diffusion-transformer model for photogrammetry that jointly handles pose estimation, depth prediction, and novel view synthesis. By employing masked learning and multi-modal fusion across RGB, camera geometry via Plücker ray maps, and depth, it enables flexible input/output configurations and trains on partially labeled data. The approach achieves state-of-the-art results in pose estimation and novel view synthesis, and demonstrates competitive mono- and multi-view depth, as well as 3D reconstruction capabilities, with the added benefit of single- and few-shot generation. This all-in-one model simplifies the photogrammetry pipeline while providing rich interactive control for 3D content creation, and highlights practical potential for real-world reconstruction tasks where data is sparse or partially labeled.
Abstract
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.
