UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

Shiu-hong Kao; Xiao Li; Jinglu Wang; Yang Li; Chi-Keung Tang; Yu-Wing Tai; Yan Lu

UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

Shiu-hong Kao, Xiao Li, Jinglu Wang, Yang Li, Chi-Keung Tang, Yu-Wing Tai, Yan Lu

TL;DR

UVRM addresses the problem of training 3D reconstruction models from unposed 2D videos by learning a pose-invariant latent representation via a transformer and decoding to a tri-plane $3$D representation. The training framework blends Score Distillation Sampling with an analysis-by-synthesis diffusion-based augmentation to synthesize pseudo-views without pose annotations. Evaluations on G-Objaverse and CO3D demonstrate robust reconstruction for diverse objects and real-world videos, outperforming pose-free NeRF baselines in several metrics. This work advances scalable 3D foundation-model development by eliminating pose-label requirements and leveraging diffusion priors for view-consistent 3D reconstruction.

Abstract

Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.

UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

TL;DR

UVRM addresses the problem of training 3D reconstruction models from unposed 2D videos by learning a pose-invariant latent representation via a transformer and decoding to a tri-plane

D representation. The training framework blends Score Distillation Sampling with an analysis-by-synthesis diffusion-based augmentation to synthesize pseudo-views without pose annotations. Evaluations on G-Objaverse and CO3D demonstrate robust reconstruction for diverse objects and real-world videos, outperforming pose-free NeRF baselines in several metrics. This work advances scalable 3D foundation-model development by eliminating pose-label requirements and leveraging diffusion priors for view-consistent 3D reconstruction.

Abstract

Paper Structure (21 sections, 9 equations, 14 figures, 3 tables)

This paper contains 21 sections, 9 equations, 14 figures, 3 tables.

Introduction
Related Work
Method
Preliminaries
UVRM Architecture
Pose-free Training
Weak-supervision with SDS loss.
Model Training
Experiments
Experiment Setup
Ablation Study
Comparison
Results on Real Videos
Conclusion
Implementation Details.
...and 6 more sections

Figures (14)

Figure 1: Different from previous methods, which either focus on (a) per-scene pose-free training or (b) 3D reconstruction model trained with known camera poses, we propose UVRM (c) aims for a fully pose-free training of 3D reconstruction model from 2D observations.
Figure 2: UVRM architecture. We propose UVRM, a transformer-based reconstruction model for pose-free monocular video inputs. It first encodes each input view into latent space with a VAE encoder vae. Next, it adopts a T5-based transformer encoder 2020t5 to extract a pose-invariant feature by implicitly aligning the image latent sequence. The extracted feature are then used modulate a style-based karras2019style synthesizer to output a tri-plane representation. Here "A" implies a learned affine transform, and "B" stands for learned per-channel scaling factors to the noise input.
Figure 3: Illustration of our pose-free training framework, where $k$ pseudo-views are randomly synthesized at scattered poses along a given trajectory to achieve self-supervision and SDS regularization at random poses for weak supervision. We iteratively augment more pseudo-views throughout the training process.
Figure 4: Iterative augmentation pipeline. We iteratively alternate between (left, green) weakly supervise training of the UVRM model with score distillation sampling (SDS) on the current set of reference view frames, and (right, orange) generating new set of novel pseudo-views for self-supervised training using the current UVRM and the pre-trained diffusion model. The generated pseudo-views can be trained with pixel-wise render loss.
Figure 5: Capability of pose-free alignment. UVRM is able to conduct pose-free alignment and 3D reconstruction from a set of monocular videos, which shows great potentials in scalability.
...and 9 more figures

UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

TL;DR

Abstract

UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (14)