Table of Contents
Fetching ...

B-DENSE: Branching For Dense Ensemble Network Learning

Cherish Puniani, Tushar Kumar, Arnav Bendre, Gaurav Kumar, Shree Singhi

TL;DR

Diffusion models deliver high-quality samples but incur slow inference due to many denoising steps. The authors introduce B-DENSE, a dense trajectory supervision framework that expands the student with multiple branches to mirror intermediate teacher states and trains them with a branch-wise loss. They provide a PF-ODE–inspired interpretation and demonstrate improvements in FID, especially at ultra-low NFEs, with negligible overhead, validating compatibility with existing distillation pipelines on CIFAR-10 and ImageNet. This approach offers a practical path to faster yet high-quality diffusion sampling and broad applicability to diffusion-distillation methods.

Abstract

Inspired by non-equilibrium thermodynamics, diffusion models have achieved state-of-the-art performance in generative modeling. However, their iterative sampling nature results in high inference latency. While recent distillation techniques accelerate sampling, they discard intermediate trajectory steps. This sparse supervision leads to a loss of structural information and introduces significant discretization errors. To mitigate this, we propose B-DENSE, a novel framework that leverages multi-branch trajectory alignment. We modify the student architecture to output $K$-fold expanded channels, where each subset corresponds to a specific branch representing a discrete intermediate step in the teacher's trajectory. By training these branches to simultaneously map to the entire sequence of the teacher's target timesteps, we enforce dense intermediate trajectory alignment. Consequently, the student model learns to navigate the solution space from the earliest stages of training, demonstrating superior image generation quality compared to baseline distillation frameworks.

B-DENSE: Branching For Dense Ensemble Network Learning

TL;DR

Diffusion models deliver high-quality samples but incur slow inference due to many denoising steps. The authors introduce B-DENSE, a dense trajectory supervision framework that expands the student with multiple branches to mirror intermediate teacher states and trains them with a branch-wise loss. They provide a PF-ODE–inspired interpretation and demonstrate improvements in FID, especially at ultra-low NFEs, with negligible overhead, validating compatibility with existing distillation pipelines on CIFAR-10 and ImageNet. This approach offers a practical path to faster yet high-quality diffusion sampling and broad applicability to diffusion-distillation methods.

Abstract

Inspired by non-equilibrium thermodynamics, diffusion models have achieved state-of-the-art performance in generative modeling. However, their iterative sampling nature results in high inference latency. While recent distillation techniques accelerate sampling, they discard intermediate trajectory steps. This sparse supervision leads to a loss of structural information and introduces significant discretization errors. To mitigate this, we propose B-DENSE, a novel framework that leverages multi-branch trajectory alignment. We modify the student architecture to output -fold expanded channels, where each subset corresponds to a specific branch representing a discrete intermediate step in the teacher's trajectory. By training these branches to simultaneously map to the entire sequence of the teacher's target timesteps, we enforce dense intermediate trajectory alignment. Consequently, the student model learns to navigate the solution space from the earliest stages of training, demonstrating superior image generation quality compared to baseline distillation frameworks.
Paper Structure (15 sections, 12 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 15 sections, 12 equations, 4 figures, 4 tables, 4 algorithms.

Figures (4)

  • Figure 1: This visualization illustrates a single iteration of the proposed B-DENSE method. Here, $x_t$ represents a noisy image at a random time step $t$. At the same time, $x_{t-3}$ is output directly mapped by the standard distillation method, and $x_{t-1}$ and $x_{t-2}$ are consecutive intermediate denoised outputs that will be skipped by the standard distillation method. The blue, yellow, and red channels indicate branched channels for $K=3$. Blue, yellow, and red arrows illustrate the mapping of each three-channel configuration to the final output and to an intermediate step used to calculate the reconstruction loss for each corresponding branch.
  • Figure 2: Results on calculating FID for 50k images
  • Figure 3: Comparison of results of B-DENSE (right) and SFD (left)
  • Figure 4: Fitted $(a, b)$ parameter pairs sorted by ascending FID score. The color scale indicates the FID score (lower is better). The clustering of high-performing configurations (dark purple) demonstrates a strong preference for a specific relationship between the growth rate $a$ and intercept $b$.