Table of Contents
Fetching ...

Neural Residual Diffusion Models for Deep Scalable Vision Generation

Zhiyuan Ma, Liangliang Zhao, Biqing Qi, Bowen Zhou

TL;DR

The paper tackles scalability bottlenecks in diffusion-based vision generation by introducing Neural-RDM, a gating-residual diffusion framework that unifies flow-shaped and U-shaped backbones within a continuous-time dynamic setting. It replaces manual scheduling with learnable gating parameters and adds a Residual-Sensitivity ODE to maintain stability as depth grows, thereby enabling essentially infinite-depth diffusion modeling. The approach is grounded in PF-ODE theory and demonstrates state-of-the-art results on image and video generation, with strong fidelity, temporal coherence, and robustness to deep stacking. This dynamic, residual-based perspective provides a practical pathway toward massively scalable vision diffusion models and informs future architecture design beyond traditional U-Net and Transformer backbones.

Abstract

The most advanced diffusion models have recently adopted increasingly deep stacked networks (e.g., U-Net or Transformer) to promote the generative emergence capabilities of vision generation models similar to large language models (LLMs). However, progressively deeper stacked networks will intuitively cause numerical propagation errors and reduce noisy prediction capabilities on generative data, which hinders massively deep scalable training of vision generation models. In this paper, we first uncover the nature that neural networks being able to effectively perform generative denoising lies in the fact that the intrinsic residual unit has consistent dynamic property with the input signal's reverse diffusion process, thus supporting excellent generative abilities. Afterwards, we stand on the shoulders of two common types of deep stacked networks to propose a unified and massively scalable Neural Residual Diffusion Models framework (Neural-RDM for short), which is a simple yet meaningful change to the common architecture of deep generative networks by introducing a series of learnable gated residual parameters that conform to the generative dynamics. Experimental results on various generative tasks show that the proposed neural residual models obtain state-of-the-art scores on image's and video's generative benchmarks. Rigorous theoretical proofs and extensive experiments also demonstrate the advantages of this simple gated residual mechanism consistent with dynamic modeling in improving the fidelity and consistency of generated content and supporting large-scale scalable training. Code is available at https://github.com/Anonymous/Neural-RDM.

Neural Residual Diffusion Models for Deep Scalable Vision Generation

TL;DR

The paper tackles scalability bottlenecks in diffusion-based vision generation by introducing Neural-RDM, a gating-residual diffusion framework that unifies flow-shaped and U-shaped backbones within a continuous-time dynamic setting. It replaces manual scheduling with learnable gating parameters and adds a Residual-Sensitivity ODE to maintain stability as depth grows, thereby enabling essentially infinite-depth diffusion modeling. The approach is grounded in PF-ODE theory and demonstrates state-of-the-art results on image and video generation, with strong fidelity, temporal coherence, and robustness to deep stacking. This dynamic, residual-based perspective provides a practical pathway toward massively scalable vision diffusion models and informs future architecture design beyond traditional U-Net and Transformer backbones.

Abstract

The most advanced diffusion models have recently adopted increasingly deep stacked networks (e.g., U-Net or Transformer) to promote the generative emergence capabilities of vision generation models similar to large language models (LLMs). However, progressively deeper stacked networks will intuitively cause numerical propagation errors and reduce noisy prediction capabilities on generative data, which hinders massively deep scalable training of vision generation models. In this paper, we first uncover the nature that neural networks being able to effectively perform generative denoising lies in the fact that the intrinsic residual unit has consistent dynamic property with the input signal's reverse diffusion process, thus supporting excellent generative abilities. Afterwards, we stand on the shoulders of two common types of deep stacked networks to propose a unified and massively scalable Neural Residual Diffusion Models framework (Neural-RDM for short), which is a simple yet meaningful change to the common architecture of deep generative networks by introducing a series of learnable gated residual parameters that conform to the generative dynamics. Experimental results on various generative tasks show that the proposed neural residual models obtain state-of-the-art scores on image's and video's generative benchmarks. Rigorous theoretical proofs and extensive experiments also demonstrate the advantages of this simple gated residual mechanism consistent with dynamic modeling in improving the fidelity and consistency of generated content and supporting large-scale scalable training. Code is available at https://github.com/Anonymous/Neural-RDM.
Paper Structure (28 sections, 29 equations, 8 figures, 2 tables)

This paper contains 28 sections, 29 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Neural Residual-style Diffusion Models framework with massively scalable gating-based minimum residual stacking unit (mrs-unit).
  • Figure 2: Overview. (a) Flow-shaped residual stacking networks. (b) U-shaped residual stacking networks. (c) Our proposed unified and massively scalable residual stacking architecture (i.e., Neural-RDM) with learnable gating-residual mechanism. (d) Residual denoising process via Neural-RDM.
  • Figure 3: Compared with the latest baseline (SDXL-1.0 podell2023sdxl), the samples produced by Neural-RDM (trained on JourneyDB sun2024journeydb) exhibit exceptional quality, particularly in terms of fidelity and consistency in the details of the subjects in adhering to the provided textual prompts.
  • Figure 4: Compared with the latest baseline (Latte-XL ma2024latte), the sample videos from SkyTimelapse xiong2018learning, Taichi-HDsiarohin2019first and UCF101 soomro2012ucf101 all exhibit better frame quality, temporal consistency and coherence.
  • Figure 5: The sensitivity of $\alpha$ and $\beta$ at different depths of the residual denoising network during the training process.
  • ...and 3 more figures