Table of Contents
Fetching ...

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M. Patel

TL;DR

DiffScaler presents a parameter-efficient method to scale a single diffusion-transformer across multiple datasets and tasks by inserting Affiner blocks that reparameterize existing weights and add new subspaces while keeping the base weights frozen. It enables incremental dataset addition, supports both conditional and unconditional generation without extra encoders or zero convolutions, and achieves competitive results with a fraction of the trainable parameters compared to baselines like ControlNet. The approach leverages per-task subspace learning and low-rank extensions to adapt transformers and CNN backbones, demonstrating strong performance on diverse datasets and tasks. This work offers a practical path toward scalable transfer learning for diffusion models with broad applicability to multi-dataset image generation.

Abstract

Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regarding the capabilities of transformer-based diffusion backbones and expanding their generative prowess to other datasets. This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly, allowing for the completion of diverse generative tasks using just one model. To this end, we propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks. In particular, we learn task-specific transformations at each layer by incorporating the ability to utilize the learned subspaces of the pre-trained model, as well as the ability to learn additional task-specific subspaces, which may be absent in the pre-training dataset. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. Moreover, we find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets. We perform experiments on four unconditional image generation datasets. We show that using our proposed method, a single pre-trained model can scale up to perform these conditional and unconditional tasks, respectively, with minimal parameter tuning while performing as close as fine-tuning an entire diffusion model for that particular task.

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

TL;DR

DiffScaler presents a parameter-efficient method to scale a single diffusion-transformer across multiple datasets and tasks by inserting Affiner blocks that reparameterize existing weights and add new subspaces while keeping the base weights frozen. It enables incremental dataset addition, supports both conditional and unconditional generation without extra encoders or zero convolutions, and achieves competitive results with a fraction of the trainable parameters compared to baselines like ControlNet. The approach leverages per-task subspace learning and low-rank extensions to adapt transformers and CNN backbones, demonstrating strong performance on diverse datasets and tasks. This work offers a practical path toward scalable transfer learning for diffusion models with broad applicability to multi-dataset image generation.

Abstract

Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regarding the capabilities of transformer-based diffusion backbones and expanding their generative prowess to other datasets. This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly, allowing for the completion of diverse generative tasks using just one model. To this end, we propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks. In particular, we learn task-specific transformations at each layer by incorporating the ability to utilize the learned subspaces of the pre-trained model, as well as the ability to learn additional task-specific subspaces, which may be absent in the pre-training dataset. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. Moreover, we find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets. We perform experiments on four unconditional image generation datasets. We show that using our proposed method, a single pre-trained model can scale up to perform these conditional and unconditional tasks, respectively, with minimal parameter tuning while performing as close as fine-tuning an entire diffusion model for that particular task.
Paper Structure (18 sections, 5 equations, 19 figures, 3 tables)

This paper contains 18 sections, 5 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Non-cherry picked Conditional comparisons
  • Figure 2: An illustration of the proposed Affiner block. We scale each weight matrix in the network and shift the trainable bias layer thus enabling the ability to utilize any subspace of the learned matrix. Moreover, to include additional subspaces, we add a parallel low-rank decomposition branch.
  • Figure 2: Non-cherry picked Conditional comparisons
  • Figure 3: Parameter efficient fine-tuning on Transformer and CNN backbone. Both the models are pre-trained on ImageNet Dataset fine-tuned on FFHQ and Flowers datasets respectively.
  • Figure 3: Non-cherry picked Conditional comparisons
  • ...and 14 more figures