Table of Contents
Fetching ...

GD-VDM: Generated Depth for better Diffusion-based Video Generation

Ariel Lapid, Idan Achituve, Lior Bracha, Ethan Fetaya

TL;DR

GD-VDM introduces a two-phase diffusion framework for video generation: first synthesize depth videos to capture scene layout, then condition a Vid2Vid diffusion model on depth to produce realistic videos. A domain-shift mitigation strategy, adding forward-diffusion noise to depth inputs during training and denoising with a depth DM, aligns training and inference distributions. Evaluations on Cityscapes (and SUN3D) show improved diversity and scene complexity over baselines, with human studies underscoring perceptual gains, though frame-level FVD does not perfectly reflect quality. The approach demonstrates the feasibility of depth-conditioned diffusion for complex video generation and offers a pathway for more realistic, diverse driving scenes.

Abstract

The field of generative models has recently witnessed significant progress, with diffusion models showing remarkable performance in image generation. In light of this success, there is a growing interest in exploring the application of diffusion models to other modalities. One such challenge is the generation of coherent videos of complex scenes, which poses several technical difficulties, such as capturing temporal dependencies and generating long, high-resolution videos. This paper proposes GD-VDM, a novel diffusion model for video generation, demonstrating promising results. GD-VDM is based on a two-phase generation process involving generating depth videos followed by a novel diffusion Vid2Vid model that generates a coherent real-world video. We evaluated GD-VDM on the Cityscapes dataset and found that it generates more diverse and complex scenes compared to natural baselines, demonstrating the efficacy of our approach.

GD-VDM: Generated Depth for better Diffusion-based Video Generation

TL;DR

GD-VDM introduces a two-phase diffusion framework for video generation: first synthesize depth videos to capture scene layout, then condition a Vid2Vid diffusion model on depth to produce realistic videos. A domain-shift mitigation strategy, adding forward-diffusion noise to depth inputs during training and denoising with a depth DM, aligns training and inference distributions. Evaluations on Cityscapes (and SUN3D) show improved diversity and scene complexity over baselines, with human studies underscoring perceptual gains, though frame-level FVD does not perfectly reflect quality. The approach demonstrates the feasibility of depth-conditioned diffusion for complex video generation and offers a pathway for more realistic, diverse driving scenes.

Abstract

The field of generative models has recently witnessed significant progress, with diffusion models showing remarkable performance in image generation. In light of this success, there is a growing interest in exploring the application of diffusion models to other modalities. One such challenge is the generation of coherent videos of complex scenes, which poses several technical difficulties, such as capturing temporal dependencies and generating long, high-resolution videos. This paper proposes GD-VDM, a novel diffusion model for video generation, demonstrating promising results. GD-VDM is based on a two-phase generation process involving generating depth videos followed by a novel diffusion Vid2Vid model that generates a coherent real-world video. We evaluated GD-VDM on the Cityscapes dataset and found that it generates more diverse and complex scenes compared to natural baselines, demonstrating the efficacy of our approach.
Paper Structure (23 sections, 13 figures, 6 tables)

This paper contains 23 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: 5-frame sequences generated from 3 methods trained on the Cityscapes dataset and a Ground Truth sequence.
  • Figure 2: Overview of the Vid2Vid-DM architecture - The architecture utilizes a specific approach to train the network. Initially, the architecture applies a forward noising process and then a reverse denoising process to the source depth maps to simulate inference artifacts. The resulting denoised depth sequences are fed into the Depth U-Net of the Vid2Vid-DM. The real-world video is also processed through a forward noising process, and the Video U-Net of the Vid2Vid-DM takes in the noised video and the output of the first U-Net, which is conditioned through concatenation skip connections.
  • Figure 3: Overview of the vid2vid architecture
  • Figure 4: Randomly selected frames from randomly generated videos, ours have more diverse and complex driving scenes. We provide more examples in Appendix \ref{['appendix:cityscapes_examples']} .
  • Figure 5: Generated depth and the synthesized video produced conditioned on him.
  • ...and 8 more figures