Table of Contents
Fetching ...

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Xin Geng

TL;DR

This work introduces FINE, a method based on the Learngene framework, to initializing downstream networks leveraging pre-trained models, while considering both model sizes and task-specific requirements, and provides a comprehensive benchmark for learngene-based methods in image generation tasks.

Abstract

Diffusion models often face slow convergence, and existing efficient training techniques, such as Parameter-Efficient Fine-Tuning (PEFT), are primarily designed for fine-tuning pre-trained models. However, these methods are limited in adapting models to variable sizes for real-world deployment, where no corresponding pre-trained models exist. To address this, we introduce FINE, a method based on the Learngene framework, to initializing downstream networks leveraging pre-trained models, while considering both model sizes and task-specific requirements. FINE decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Σ$, and $V$), where $U$ and $V$ are shared across network blocks as ``learngenes'', and $Σ$ remains layer-specific. During initialization, FINE trains only $Σ$ using a small subset of data, while keeping the learngene parameters fixed, marking it the first approach to integrate both size and task considerations in initialization. We provide a comprehensive benchmark for learngene-based methods in image generation tasks, and extensive experiments demonstrate that FINE consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes. FINE also offers significant computational and storage savings, reducing training steps by approximately $3N\times$ and storage by $5\times$, where $N$ is the number of models. Additionally, FINE's adaptability to tasks yields an average performance improvement of 4.29 and 3.30 in FID and sFID across multiple downstream datasets, highlighting its versatility and efficiency.

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

TL;DR

This work introduces FINE, a method based on the Learngene framework, to initializing downstream networks leveraging pre-trained models, while considering both model sizes and task-specific requirements, and provides a comprehensive benchmark for learngene-based methods in image generation tasks.

Abstract

Diffusion models often face slow convergence, and existing efficient training techniques, such as Parameter-Efficient Fine-Tuning (PEFT), are primarily designed for fine-tuning pre-trained models. However, these methods are limited in adapting models to variable sizes for real-world deployment, where no corresponding pre-trained models exist. To address this, we introduce FINE, a method based on the Learngene framework, to initializing downstream networks leveraging pre-trained models, while considering both model sizes and task-specific requirements. FINE decomposes pre-trained knowledge into the product of matrices (i.e., , , and ), where and are shared across network blocks as ``learngenes'', and remains layer-specific. During initialization, FINE trains only using a small subset of data, while keeping the learngene parameters fixed, marking it the first approach to integrate both size and task considerations in initialization. We provide a comprehensive benchmark for learngene-based methods in image generation tasks, and extensive experiments demonstrate that FINE consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes. FINE also offers significant computational and storage savings, reducing training steps by approximately and storage by , where is the number of models. Additionally, FINE's adaptability to tasks yields an average performance improvement of 4.29 and 3.30 in FID and sFID across multiple downstream datasets, highlighting its versatility and efficiency.
Paper Structure (24 sections, 6 equations, 2 figures, 3 tables)

This paper contains 24 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Can we decompose the knowledge in pre-trained models to extract size-independent components for effectively initializing models of various sizes when the original model is too large to deploy?
  • Figure 2: Framework of FINE: (a) The knowledge within a Diffusion Transformer (DiT) is initially decomposed into shared singular vectors, $U$ and $V$, along with layer-specific singular values, $\Sigma$, as described by Eq. (\ref{['equ:svd']}). This factorization captures the shared, size-agnostic components of the model (the learngenes), while maintaining layer-dependent variations through $\Sigma$. (b) During model initialization, only the singular values $\Sigma$ need to be adapted based on the size of the target model. These values are optimized using a small amount of data from the target task, while the learngenes, represented by the shared $U$ and $V$, remain fixed. This approach facilitates efficient task-specific and size-adaptive initialization.