Table of Contents
Fetching ...

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide

Abstract

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Abstract

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.
Paper Structure (46 sections, 6 equations, 20 figures, 6 tables, 1 algorithm)

This paper contains 46 sections, 6 equations, 20 figures, 6 tables, 1 algorithm.

Figures (20)

  • Figure 1: ChopGrad Method. ChopGrad unlocks pixel-wise losses for high resolution, long-duration video diffusion models. It leverages truncated backpropagation to eliminate recursive activation accumulation in video autoencoders with causal caching. Solid arrows indicate the flow of information in the decoder forward pass, dashed ones indicate the backward flow of gradients with ChopGrad. Adding ChopGrad to training procedures is easy and produces state of the art performance in a variety of applications that benefit from pixel-wise losses, such as video super-resolution, video inpainting, video enhancement of neural rendered scenes, and controlled driving video generation.
  • Figure 2: ChopGrad Model Architecture. Given the processed video frame latents, the video decoder iteratively applies causal caching at each layer, producing pixel outputs. Caching is performed by taking a subset of the layer outputs and appending these to the beginning of the layer inputs for the next frame group. While substantially reducing memory use at inference time compared to full 3D convolution over all frame groups, during training this process introduces recursive activation accumulation in the decoder, making backpropagation prohibitively expensive for high-resolution or long videos when using pixel-wise losses. Using truncated backpropagation, we only allow gradients to accumulate through a fixed number ($D_{trunc}$) of previous frame groups.
  • Figure 3: Temporal Locality. Influence measure samples \ref{['eq:influence']} as a function of temporal distance between decoder inputs (i.e. latent embeddings) and outputs (i.e. pixels) alongside the mean and line of best fit. As temporal distance increases, the influence between embeddings decreases exponentially, resulting in minimal gradient contributions \ref{['eq:final_grad']}.
  • Figure 4: Impact of Truncation Distance on Backbone Model Parameter Gradients. Normalized MAE and cosine distance (computed by flattening all model parameters) are shown. Though error is significant at small truncation distances, the cosine similarity remains high across all distances, implying that the errors are primarily of magnitude, not direction.
  • Figure 5: Truncation Induced Gradient Error. Mean gradient error \ref{['eq:full_error']} between the truncated and full backpropagation algorithms as a function of truncation distance.
  • ...and 15 more figures