Table of Contents
Fetching ...

JOG3R: Towards 3D-Consistent Video Generators

Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, Duygu Ceylan

TL;DR

JOG3R tackles the lack of 3D-consistency in diffusion-based video generation by jointly training a video diffusion backbone with a 3D point-map reconstruction head, effectively unifying text-to-video and 2D-to-3D tasks. The method stitches a DiT-based OpenSora backbone to a DUSt3R-style reconstruction head and optimizes with a generation loss $L_{gen}$ and a reconstruction loss $L_{rec}$, forming $L_{total}=L_{gen}+\lambda L_{rec}$ with $\lambda=1$. At inference, the model can perform T2V, V2C, or T2V+C by routing intermediate features and applying temporal regularizers to enforce smooth camera trajectories. Experiments on RealEstate10K show that JOG3R yields 3D-consistent videos with competitive photometric quality and competitive camera pose estimation, outperforming baselines on 3D-consistency metrics such as $MEt3R$. Overall, the work demonstrates that simultaneous optimization of generation and 3D reconstruction is synergistic, enabling 3D-aware video generation without sacrificing visual realism.

Abstract

Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.

JOG3R: Towards 3D-Consistent Video Generators

TL;DR

JOG3R tackles the lack of 3D-consistency in diffusion-based video generation by jointly training a video diffusion backbone with a 3D point-map reconstruction head, effectively unifying text-to-video and 2D-to-3D tasks. The method stitches a DiT-based OpenSora backbone to a DUSt3R-style reconstruction head and optimizes with a generation loss and a reconstruction loss , forming with . At inference, the model can perform T2V, V2C, or T2V+C by routing intermediate features and applying temporal regularizers to enforce smooth camera trajectories. Experiments on RealEstate10K show that JOG3R yields 3D-consistent videos with competitive photometric quality and competitive camera pose estimation, outperforming baselines on 3D-consistency metrics such as . Overall, the work demonstrates that simultaneous optimization of generation and 3D reconstruction is synergistic, enabling 3D-aware video generation without sacrificing visual realism.

Abstract

Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator - OpenSora in our case - can support camera pose estimation. Surprisingly, at first, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation (i.e.,DUSt3R [79]) networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named \nameMethod, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
Paper Structure (15 sections, 2 equations, 4 figures, 3 tables)

This paper contains 15 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: We propose a unified framework to investigate if the intermediate features from a video generation model can be repurposed for 3D point map estimation by routing them to the SoTA decoder of DUSt3R. We investigate the effect of freezing vs training certain modules of the generator using different combination of generation and reconstruction losses.
  • Figure 2: We base our analysis on three main tasks: text-to-video (T2V), video to camera estimation (V2C), and joint video generation and camera estimation (T2V+C) at inference time.
  • Figure 3: Qualitative camera pose estimation (V2C) results. Red to purple indicates the progression from the first to the last frame. Note that on these test videos, JOG3R yields improved point maps leading to improved camera tracks compared to pretrained DUSt3R.
  • Figure 4: Qualitative generation T2V+C results. It is coherent with the camera paths from T2V$\rightarrow$V2C. Please see suppmat. for videos.