Table of Contents
Fetching ...

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

Makoto Shing, Masanori Koyama, Takuya Akiba

TL;DR

DiffusionBlocks addresses the memory bottleneck of end-to-end backpropagation by transforming transformer-style networks into independent diffusion blocks trained with a denoising score-matching objective. By interpreting residual updates as Euler steps of a reverse diffusion process, blocks can be trained independently with gradients for only one block at a time, enabling memory savings proportional to the number of blocks. The equi-probability partitioning of the noise distribution ensures balanced utilization of parameters across blocks and improves learning efficiency across tasks. Empirically, DiffusionBlocks matches end-to-end performance across vision, diffusion, autoregressive, and recurrent-depth tasks while delivering substantial memory and compute savings, illustrating its broad applicability to modern generative AI.

Abstract

End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures.

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

TL;DR

DiffusionBlocks addresses the memory bottleneck of end-to-end backpropagation by transforming transformer-style networks into independent diffusion blocks trained with a denoising score-matching objective. By interpreting residual updates as Euler steps of a reverse diffusion process, blocks can be trained independently with gradients for only one block at a time, enabling memory savings proportional to the number of blocks. The equi-probability partitioning of the noise distribution ensures balanced utilization of parameters across blocks and improves learning efficiency across tasks. Empirically, DiffusionBlocks matches end-to-end performance across vision, diffusion, autoregressive, and recurrent-depth tasks while delivering substantial memory and compute savings, illustrating its broad applicability to modern generative AI.

Abstract

End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose , a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures.

Paper Structure

This paper contains 35 sections, 12 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of DiffusionBlocks.Left: Standard networks require backpropagation through all layers. Center: DiffusionBlocks partitions networks into blocks, each trained independently to denoise within assigned noise ranges. Right: Applications. For diffusion models (top), inference requires only the relevant block per denoising step. For recurrent-depth models (bottom), our framework replaces iterative training with single-pass training, eliminating the computational overhead of backpropagation through time.
  • Figure 2: 3-step conversion of a standard neural network to DiffusionBlocks at training phase.Step 1: Partition $L$ layers into $B$ blocks. Step 2: Define noise distribution $p_\sigma$ (e.g., log-normal) and partition the range $[\sigma_{\min}, \sigma_{\max}]$ into $B$ intervals $\{[\sigma_{b}, \sigma_{b-1}]\}_{b=1}^B$, assigning each block a specific noise range (Section \ref{['sec:partitioning']}). Step 3: Augment blocks with noise conditioning: extend input to $\tilde{\mathbf{x}} = (\mathbf{x}, \mathbf{z}_\sigma)$ where $\mathbf{z}_\sigma = \mathbf{y} + \sigma\boldsymbol{\epsilon}$, and incorporate noise-level conditioning (e.g., via AdaLN). Then, each block is trained independently from other blocks to denoise within its assigned noise range.
  • Figure 3: Training and inference algorithms for standard residual networks (left) versus DiffusionBlocks (right). Given: A $L$-layer network partitioned into $B$ blocks with noise ranges $\{[\sigma_{b}, \sigma_{b-1}]\}_{b=1}^B$, noise distribution $p_\sigma$, and training data $\{(\mathbf{x}_n, \mathbf{y}_n)\}_{n=1}^N$. The function $w(\sigma)$ denotes the loss weighting, and $\bar{f}_{\boldsymbol{\theta}_b \mid \cdot}$ represents the noised-conditioned block with parameters $\boldsymbol{\theta}_b$.
  • Figure 4: Equi-probability partitioning ($B=3$). Blocks partition the log-normal $p_\sigma$ by equal probability mass (orange boundaries), not uniform spacing (gray), concentrating capacity where denoising is most challenging.
  • Figure 5: Converting different architectures to DiffusionBlocks: Training. During training, noise is added to target outputs (labels, embeddings, or images) and each block learns to denoise within its assigned noise range. Blocks are sampled randomly and trained independently, requiring gradients for only one block at a time.
  • ...and 1 more figures