JOG3R: Towards 3D-Consistent Video Generators
Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, Duygu Ceylan
TL;DR
JOG3R tackles the lack of 3D-consistency in diffusion-based video generation by jointly training a video diffusion backbone with a 3D point-map reconstruction head, effectively unifying text-to-video and 2D-to-3D tasks. The method stitches a DiT-based OpenSora backbone to a DUSt3R-style reconstruction head and optimizes with a generation loss $L_{gen}$ and a reconstruction loss $L_{rec}$, forming $L_{total}=L_{gen}+\lambda L_{rec}$ with $\lambda=1$. At inference, the model can perform T2V, V2C, or T2V+C by routing intermediate features and applying temporal regularizers to enforce smooth camera trajectories. Experiments on RealEstate10K show that JOG3R yields 3D-consistent videos with competitive photometric quality and competitive camera pose estimation, outperforming baselines on 3D-consistency metrics such as $MEt3R$. Overall, the work demonstrates that simultaneous optimization of generation and 3D reconstruction is synergistic, enabling 3D-aware video generation without sacrificing visual realism.
Abstract
Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
