Table of Contents
Fetching ...

Hierarchical Video Generation for Complex Data

Lluis Castrejon, Nicolas Ballas, Aaron Courville

TL;DR

HVG introduces a hierarchical, coarse-to-fine approach to video generation that partitions the task into sequential levels, enabling high-resolution and long-duration video synthesis with reduced memory demands. Each level is trained as a GAN, with the first level establishing a global, low-resolution outline and subsequent upsampling levels refining both spatial and temporal details while remaining grounded to earlier outputs via a matching discriminator. The model is validated on Kinetics-600 and BDD100K, achieving competitive IS/FID/FVD scores and enabling 256x256 videos with 48 frames—scalability demonstrated through multi-level training and partial-view training strategies. Ablations show the importance of grounding through the matching discriminator and the trade-offs between temporal context and computational cost, underscoring HVG's strong scaling properties relative to prior methods like DVD-GAN. Overall, HVG offers a practical path toward high-quality, long-horizon video generation on real-world data.

Abstract

Videos can often be created by first outlining a global description of the scene and then adding local details. Inspired by this we propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We train each level in our hierarchy sequentially on partial views of the videos. This reduces the computational complexity of our generative model, which scales to high-resolution videos beyond a few frames. We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.

Hierarchical Video Generation for Complex Data

TL;DR

HVG introduces a hierarchical, coarse-to-fine approach to video generation that partitions the task into sequential levels, enabling high-resolution and long-duration video synthesis with reduced memory demands. Each level is trained as a GAN, with the first level establishing a global, low-resolution outline and subsequent upsampling levels refining both spatial and temporal details while remaining grounded to earlier outputs via a matching discriminator. The model is validated on Kinetics-600 and BDD100K, achieving competitive IS/FID/FVD scores and enabling 256x256 videos with 48 frames—scalability demonstrated through multi-level training and partial-view training strategies. Ablations show the importance of grounding through the matching discriminator and the trade-offs between temporal context and computational cost, underscoring HVG's strong scaling properties relative to prior methods like DVD-GAN. Overall, HVG offers a practical path toward high-quality, long-horizon video generation on real-world data.

Abstract

Videos can often be created by first outlining a global description of the scene and then adding local details. Inspired by this we propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We train each level in our hierarchy sequentially on partial views of the videos. This reduces the computational complexity of our generative model, which scales to high-resolution videos beyond a few frames. We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.

Paper Structure

This paper contains 43 sections, 5 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Hierarchical Video Generation We propose to divide the generative process into multiple simpler problems. HVG first generates a low resolution video that depicts a full scene at a reduced framerate. This scene outline is then progressively upscaled and temporally interpolated. Levels are trained sequentially and do not backprogagate the gradient to the previous levels. Additionally, upscaling levels are trained on temporal crops of their inputs during training (illustrated by the non-shaded images) to reduce their computational requirements. Our model is competitive with the state-of-the-art in video generation and enables the generation of longer high resolution videos than possible with previous methods.
  • Figure 2: Upsampling level parametrization The upsampling levels use a conditional generator and three discriminators - spatial/2D, temporal/3D and matching. The conditional generator learns to upsample the previous level output, while the matching discriminator is trained on pairs of real/generated conditions and outputs.
  • Figure 3: Randomly selected HVG 48/128x128 frame samples for Kinetics-600: These samples were generated by unrolling HVG 12/128x128 to generate 48 frame sequences, 4 times its training horizon. Each row shows frames from the same sample at different timesteps. The generations are temporally consistent and the frame quality does not degrade over time.
  • Figure 4: DVD-GAN fails to generate samples beyond its training horizon These samples were obtained by changing the spatial dimensions of the latent in a 6/128x128 DVD-GAN model to produce 48/128x128 videos. The samples quickly degrade after the first few frames and become motionless.
  • Figure 5: Scaling the computational costs This plot shows the required GPU memory for a two-level HVG. We observe that the costs scales linearly with the output length for the first level, while the cost for the second level is fixed because it operates on a fixed length partial view of its input. Our model scales better than a comparable non-hierarchical model.
  • ...and 10 more figures