Table of Contents
Fetching ...

TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision

Shaobin Zhuang, Yiwei Guo, Yanbo Ding, Kunchang Li, Xinyuan Chen, Yaohui Wang, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang

TL;DR

Diffusion models enable high-quality vision generation but are costly to fine-tune for downstream tasks. TimeStep Master (TSM) mitigates this by learning multiple TimeStep LoRA experts across timestep intervals (fostering) and assembling them asymmetrically with a core expert from the finest scale plus gated context experts (assembling), guided by a timestep-conditioned router. This two-stage approach yields state-of-the-art results across domain adaptation, post-pretraining, and model distillation, generalizing across UNet, DiT, MM-DiT and Image/Video data while reducing fine-tuning cost. Notably, TSM achieves competitive FID with substantially fewer trainable parameters (often <1M) and lower training days (e.g., about $3.7$ A100 days) while attaining results such as FID $=9.90$ on COCO2014 distillation. Overall, TSM provides a scalable, versatile framework for efficient diffusion-model fine-tuning with broad practical impact.

Abstract

Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.

TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision

TL;DR

Diffusion models enable high-quality vision generation but are costly to fine-tune for downstream tasks. TimeStep Master (TSM) mitigates this by learning multiple TimeStep LoRA experts across timestep intervals (fostering) and assembling them asymmetrically with a core expert from the finest scale plus gated context experts (assembling), guided by a timestep-conditioned router. This two-stage approach yields state-of-the-art results across domain adaptation, post-pretraining, and model distillation, generalizing across UNet, DiT, MM-DiT and Image/Video data while reducing fine-tuning cost. Notably, TSM achieves competitive FID with substantially fewer trainable parameters (often <1M) and lower training days (e.g., about A100 days) while attaining results such as FID on COCO2014 distillation. Overall, TSM provides a scalable, versatile framework for efficient diffusion-model fine-tuning with broad practical impact.

Abstract

Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.

Paper Structure

This paper contains 34 sections, 7 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Motivation Visualization. (a) During generation process, the hidden states in the same block of pre-trained PixArt-$\alpha$ changes significantly with timestep. (b) The pre-trained model and LoRA-tuned model incorrectly generate green bench and red vase, while TSM corrects these errors. (c) LoRA-tuned model generates degraded images, while TSM benefits visual quality and text alignment.
  • Figure 2: Fostering Stage: TimeStep LoRA Expert Construction. We divide all $T$ timesteps into $n$ intervals and fine-tune the diffusion model with individual LoRA module for each interval.
  • Figure 3: Assembling Stage: Asymmetrical Mixture of TimeStep LoRA Experts. We divide $T$ into $4$ intervals, namely $n_{1}$$=$$8$, $n_{2}$$=$$4$, $n_{3}$$=$$2$, $n_{4}$$=$$1$. The TimeStep LoRA expert within the smallest-scale interval plays the core role to model the noise level of $t$ with fine granularity. The core expert (red) is without gating; the context experts (blue, yellow and green) are with gating. The router is timestep-dependent, which adaptively weights the importance of context experts at $t$.
  • Figure 4: Comparison on Video Modality. The videos generated by the LoRA-tuned model are not aligned with the prompts, while our TSM facilitates high-quality and consistent video generation.
  • Figure 5: Comparison on Model Distillation. The images generated by our TSM better align with the prompts, outperforming the vanilla LoRA, and even surpassing the teacher SD1.5 in some cases.