Table of Contents
Fetching ...

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov

TL;DR

The paper addresses the challenge of scaling diffusion models to high-resolution video by proposing Hierarchical Patch Diffusion Models (HPDM) that operate on a pyramid of patches rather than full-resolution inputs. It introduces deep context fusion to condition high-resolution patches on globally aligned features from lower levels and adaptive computation to allocate capacity preferentially to coarse details, enabling end-to-end training directly in pixel space. HPDM achieves state-of-the-art results on UCF-101 with FVD of $66.32$ and IS of $87.68$, and demonstrates rapid fine-tuning from a low-resolution base to high-resolution text-to-video synthesis, marking the first end-to-end diffusion model at such high resolutions. The approach offers substantial efficiency gains and demonstrates strong scalability for text-to-video tasks, with potential applicability to other patch-wise generative paradigms.

Abstract

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 $256^2$, surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base $36\times 64$ low-resolution generator for high-resolution $64 \times 288 \times 512$ text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.

Hierarchical Patch Diffusion Models for High-Resolution Video Generation

TL;DR

The paper addresses the challenge of scaling diffusion models to high-resolution video by proposing Hierarchical Patch Diffusion Models (HPDM) that operate on a pyramid of patches rather than full-resolution inputs. It introduces deep context fusion to condition high-resolution patches on globally aligned features from lower levels and adaptive computation to allocate capacity preferentially to coarse details, enabling end-to-end training directly in pixel space. HPDM achieves state-of-the-art results on UCF-101 with FVD of and IS of , and demonstrates rapid fine-tuning from a low-resolution base to high-resolution text-to-video synthesis, marking the first end-to-end diffusion model at such high resolutions. The approach offers substantial efficiency gains and demonstrates strong scalability for text-to-video tasks, with potential applicability to other patch-wise generative paradigms.

Abstract

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 , surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base low-resolution generator for high-resolution text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.
Paper Structure (21 sections, 9 equations, 11 figures, 8 tables)

This paper contains 21 sections, 9 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Comparing existing diffusion paradigms: Latent Diffusion Model (LDM) LDMLSGM (upper left), Cascaded Diffusion Model (CDM) CDM (bottom left), and Patch Diffusion Model (this work) during training (upper right) and inference (bottom right). In our work, we develop hierarchical patch diffusion, which never operates on full-resolution inputs, but instead optimizes the lower stages of the hierarchy to produce spatially aligned context information for the later pyramid levels to enforce global consistency between patches.
  • Figure 2: Architecture overview of Hierarchical Patch Diffusion Model (HPDM) for a 3-level pyramid. The model is trained to denoise all the patches jointly. During training, we use only a single patch from each pyramid level and restrict information propagation in the coarse-to-fine manner. This allows one to synthesize the whole image (or video) at a given resolution patch-by-patch using tiled inference (see Figure \ref{['fig:paradigms-comparison']}).
  • Figure 3: Deep Context Fusion. At each pyramid level, we grid-sample the features of a lower-resolution patch and concatenate them to the activations tensor of the current level. In this way, the information propagates in the coarse-to-fine manner and provides richer context than pixel-space concatenation of cascaded DMs (see \ref{['tab:ucf-ablations']}).
  • Figure 4: Provided samples from PVDM PVDM (left) and random samples from HPDM-L (right) for the same classes on UCF $256^2$. More samples are provided in Appendix \ref{['supp:sec:additional-results']}.
  • Figure 5: HPDM-T2V is able to efficiently fine-tune from the standard low-resolution generator to high-resolution $64\times 288 \times 512$ text-to-video generation when fine-tuned from a low-resolution $36 \times 64$ diffusion for just 15,000 training steps.
  • ...and 6 more figures