Table of Contents
Fetching ...

Denoising Task Routing for Diffusion Models

Byeongjun Park, Sangmin Woo, Hyojun Go, Jin-Young Kim, Changick Kim

TL;DR

The paper addresses negative transfer in diffusion-model training by treating each denoising timestep as a separate task and introduces Denoising Task Routing (DTR), an add-on that imposes per-timestep task pathways through channel masking. DTR leverages Task Affinity between adjacent timesteps via a sliding-window masking strategy and boosts earlier timesteps via Task Weights, all without adding parameters. Empirically, DTR consistently improves FID, IS, and precision-recall across unconditional, class-conditional, and text-to-image tasks, accelerates convergence, and enables smaller architectures to match larger ones with far fewer iterations. The results demonstrate that explicit architectural MT L design for diffusion models, in combination with compatible loss-weighting strategies, can yield substantial gains and practical efficiency gains in generative image synthesis.

Abstract

Diffusion models generate highly realistic images by learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplored area in designing neural architectures that explicitly incorporate MTL into the framework of diffusion models. In this paper, we present Denoising Task Routing (DTR), a simple add-on strategy for existing diffusion model architectures to establish distinct information pathways for individual tasks within a single architecture by selectively activating subsets of channels in the model. What makes DTR particularly compelling is its seamless integration of prior knowledge of denoising tasks into the framework: (1) Task Affinity: DTR activates similar channels for tasks at adjacent timesteps and shifts activated channels as sliding windows through timesteps, capitalizing on the inherent strong affinity between tasks at adjacent timesteps. (2) Task Weights: During the early stages (higher timesteps) of the denoising process, DTR assigns a greater number of task-specific channels, leveraging the insight that diffusion models prioritize reconstructing global structure and perceptually rich contents in earlier stages, and focus on simple noise removal in later stages. Our experiments reveal that DTR not only consistently boosts diffusion models' performance across different evaluation protocols without adding extra parameters but also accelerates training convergence. Finally, we show the complementarity between our architectural approach and existing MTL optimization techniques, providing a more complete view of MTL in the context of diffusion training. Significantly, by leveraging this complementarity, we attain matched performance of DiT-XL using the smaller DiT-L with a reduction in training iterations from 7M to 2M.

Denoising Task Routing for Diffusion Models

TL;DR

The paper addresses negative transfer in diffusion-model training by treating each denoising timestep as a separate task and introduces Denoising Task Routing (DTR), an add-on that imposes per-timestep task pathways through channel masking. DTR leverages Task Affinity between adjacent timesteps via a sliding-window masking strategy and boosts earlier timesteps via Task Weights, all without adding parameters. Empirically, DTR consistently improves FID, IS, and precision-recall across unconditional, class-conditional, and text-to-image tasks, accelerates convergence, and enables smaller architectures to match larger ones with far fewer iterations. The results demonstrate that explicit architectural MT L design for diffusion models, in combination with compatible loss-weighting strategies, can yield substantial gains and practical efficiency gains in generative image synthesis.

Abstract

Diffusion models generate highly realistic images by learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplored area in designing neural architectures that explicitly incorporate MTL into the framework of diffusion models. In this paper, we present Denoising Task Routing (DTR), a simple add-on strategy for existing diffusion model architectures to establish distinct information pathways for individual tasks within a single architecture by selectively activating subsets of channels in the model. What makes DTR particularly compelling is its seamless integration of prior knowledge of denoising tasks into the framework: (1) Task Affinity: DTR activates similar channels for tasks at adjacent timesteps and shifts activated channels as sliding windows through timesteps, capitalizing on the inherent strong affinity between tasks at adjacent timesteps. (2) Task Weights: During the early stages (higher timesteps) of the denoising process, DTR assigns a greater number of task-specific channels, leveraging the insight that diffusion models prioritize reconstructing global structure and perceptually rich contents in earlier stages, and focus on simple noise removal in later stages. Our experiments reveal that DTR not only consistently boosts diffusion models' performance across different evaluation protocols without adding extra parameters but also accelerates training convergence. Finally, we show the complementarity between our architectural approach and existing MTL optimization techniques, providing a more complete view of MTL in the context of diffusion training. Significantly, by leveraging this complementarity, we attain matched performance of DiT-XL using the smaller DiT-L with a reduction in training iterations from 7M to 2M.
Paper Structure (34 sections, 9 equations, 16 figures, 8 tables, 3 algorithms)

This paper contains 34 sections, 9 equations, 16 figures, 8 tables, 3 algorithms.

Figures (16)

  • Figure 1: The overview of DTR. DTR makes explicit task-specific pathways by channel masking.
  • Figure 2: Routing masks in random routing and DTR with varying $\alpha$ ($\beta$ is fixed to 0.8). The activated and deactivated channels are colored in yellow and purple, respectively.
  • Figure 3: Compatibility of DTR and MTL loss weighting methods w.r.t. guidance scale. DTR robustly boosts the performance across various guidance scales for all metrics.
  • Figure 4: $\alpha, \beta$ ablation. We use DiT-B/2 on FFHQ 256$\times$256.
  • Figure 5: Convergence comparison on ImageNet. DTR accelerates faster FID-10K improvement.
  • ...and 11 more figures