Hierarchical Video Generation for Complex Data
Lluis Castrejon, Nicolas Ballas, Aaron Courville
TL;DR
HVG introduces a hierarchical, coarse-to-fine approach to video generation that partitions the task into sequential levels, enabling high-resolution and long-duration video synthesis with reduced memory demands. Each level is trained as a GAN, with the first level establishing a global, low-resolution outline and subsequent upsampling levels refining both spatial and temporal details while remaining grounded to earlier outputs via a matching discriminator. The model is validated on Kinetics-600 and BDD100K, achieving competitive IS/FID/FVD scores and enabling 256x256 videos with 48 frames—scalability demonstrated through multi-level training and partial-view training strategies. Ablations show the importance of grounding through the matching discriminator and the trade-offs between temporal context and computational cost, underscoring HVG's strong scaling properties relative to prior methods like DVD-GAN. Overall, HVG offers a practical path toward high-quality, long-horizon video generation on real-world data.
Abstract
Videos can often be created by first outlining a global description of the scene and then adding local details. Inspired by this we propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We train each level in our hierarchy sequentially on partial views of the videos. This reduces the computational complexity of our generative model, which scales to high-resolution videos beyond a few frames. We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.
