Hierarchical Patch Diffusion Models for High-Resolution Video Generation
Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov
TL;DR
The paper addresses the challenge of scaling diffusion models to high-resolution video by proposing Hierarchical Patch Diffusion Models (HPDM) that operate on a pyramid of patches rather than full-resolution inputs. It introduces deep context fusion to condition high-resolution patches on globally aligned features from lower levels and adaptive computation to allocate capacity preferentially to coarse details, enabling end-to-end training directly in pixel space. HPDM achieves state-of-the-art results on UCF-101 with FVD of $66.32$ and IS of $87.68$, and demonstrates rapid fine-tuning from a low-resolution base to high-resolution text-to-video synthesis, marking the first end-to-end diffusion model at such high resolutions. The approach offers substantial efficiency gains and demonstrates strong scalability for text-to-video tasks, with potential applicability to other patch-wise generative paradigms.
Abstract
Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 $256^2$, surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base $36\times 64$ low-resolution generator for high-resolution $64 \times 288 \times 512$ text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.
