Table of Contents
Fetching ...

View-Consistent Diffusion Representations for 3D-Consistent Video Generation

Duolikun Danier, Ge Gao, Steven McDonagh, Changjian Li, Hakan Bilen, Oisin Mac Aodha

TL;DR

The paper tackles 3D inconsistencies in diffusion-based video generation by first showing a strong link between view-consistent internal diffusion representations and 3D coherence across seven camera-controlled VDMs. It then introduces ViCoDR, a model-agnostic training-time approach that enforces view-consistent representations via a ranking-based 3D correspondence loss using VGGT-derived pseudo-3D labels, with no inference-time overhead. ViCoDR demonstrates substantial improvements in 3D consistency across camera-controlled I2V, T2V, and multi-view generation models, while maintaining competitive image-quality and controllability metrics. The work highlights the practical impact of multi-view representation alignment for robust, geometrically coherent video synthesis and outlines trade-offs and limitations, including added training cost and applicability to static scenes.

Abstract

Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.

View-Consistent Diffusion Representations for 3D-Consistent Video Generation

TL;DR

The paper tackles 3D inconsistencies in diffusion-based video generation by first showing a strong link between view-consistent internal diffusion representations and 3D coherence across seven camera-controlled VDMs. It then introduces ViCoDR, a model-agnostic training-time approach that enforces view-consistent representations via a ranking-based 3D correspondence loss using VGGT-derived pseudo-3D labels, with no inference-time overhead. ViCoDR demonstrates substantial improvements in 3D consistency across camera-controlled I2V, T2V, and multi-view generation models, while maintaining competitive image-quality and controllability metrics. The work highlights the practical impact of multi-view representation alignment for robust, geometrically coherent video synthesis and outlines trade-offs and limitations, including added training cost and applicability to static scenes.

Abstract

Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.

Paper Structure

This paper contains 25 sections, 5 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Training video diffusion models (VDMs) with ViCoDR results in more 3D-consistent output videos. ViCoDR enforces view-consistent diffusion representations during training, enhancing the VDMs' 3D awareness. Here we show generated frames from a camera controlled VDM he2025cameractrl, where we see significant visual artifacts on the front wheel and frame of the bike. In contrast, ViCoDR's output is more 3D consistent, as shown by the MEt3R asim2025met3r reprojection error map.
  • Figure 2: Analysis of 3D consistency in camera-controlled VDMs. We observe a strong correlation between 3D consistency of video generation (measured by MEt3R) and view consistency of VDM representations (measured by geometric correspondence).
  • Figure 3: Overview of ViCoDR. During video diffusion training, ViCoDR additionally supervises internal diffusion representations $(h^a,h^b)$ extracted from frame pairs $(x^a,x^b)$ with a 3D correspondence loss (\ref{['eqn:L_3dc']}), so as to learn view-consistent representations (PCA feature maps are visualized).
  • Figure 4: Qualitative comparison with baseline methods on CameraCtrl. CameraCtrl trained with ViCoDR is able to generate more 3D-consistent video frames at new viewpoints compared to the baselines. Frame indices are shown at the bottom right of each image. The first column indicates the conditioning input, where at the bottom left we illustrate the camera poses of the input frame and the plotted novel view. The first two rows show examples from RE10K, and the last row from DL3DV. See examples in the supplementary video.
  • Figure 5: Qualitative results of applying ViCoDR to text-to-video generation. Text prompt is shown above images. Arrows highlight 3D consistent/inconsistent regions. Better viewed zoomed-in. See more examples in the Appendix.
  • ...and 10 more figures