Table of Contents
Fetching ...

DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

Tianhui Song, Weixin Feng, Shuai Wang, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

TL;DR

This work addresses the redundancy and deployment burden caused by numerous specialized text-to-image diffusion checkpoints. It introduces DMM, a distillation-based model merging framework that uses a style-promptable student and a multi-teacher score-distillation objective to fuse diverse expert styles into one versatile diffusion model. The method employs three loss terms—score distillation, feature imitation, and multi-class adversarial loss—along with continual learning regularization to support incremental merging, and evaluates with a novel FID_t metric that tracks how well the merged model matches each teacher’s style distribution. Experimental results show that DMM achieves near-upper-bound performance on arbitrary-style generation, supports smooth style mixing, and remains compatible with downstream plugins, offering a scalable path to parameter-efficient, steerable T2I generation in real-world deployments.

Abstract

The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.

DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

TL;DR

This work addresses the redundancy and deployment burden caused by numerous specialized text-to-image diffusion checkpoints. It introduces DMM, a distillation-based model merging framework that uses a style-promptable student and a multi-teacher score-distillation objective to fuse diverse expert styles into one versatile diffusion model. The method employs three loss terms—score distillation, feature imitation, and multi-class adversarial loss—along with continual learning regularization to support incremental merging, and evaluates with a novel FID_t metric that tracks how well the merged model matches each teacher’s style distribution. Experimental results show that DMM achieves near-upper-bound performance on arbitrary-style generation, supports smooth style mixing, and remains compatible with downstream plugins, offering a scalable path to parameter-efficient, steerable T2I generation in real-world deployments.

Abstract

The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.

Paper Structure

This paper contains 32 sections, 14 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Examples of image generation. Our DMM is able to generate images with various expert styles (realistic style, Asian portrait, anime style, etc.) under the control of style prompts.
  • Figure 2: Distributed Training Framework for DMM. (a) The model layout on a GPU cluster during training. Each node is assigned a specific teacher model to jointly supervise a student model with shared parameters. A set of learnable embeddings (style prompts) are maintained to provide hints and differentiate from each other. (b) Continual Learning. New teacher models are involved through initializing and adding new embeddings. The frozen pretrained student model serves as regularization with style prompts randomly selected.
  • Figure 3: Style-promptable generation pipeline for disitllation-based model merging. Our proposed distillation objective incorporates three loss terms: Score Distillation, Feature Imitation, and Multi-Class Adversarial Loss.
  • Figure 4: Heatmap of the FID matrix. The left one is the result ${\mathbf{M}}$ of our model, and the right one is the reference matrix ${\mathbf{M}}_\text{ref}$.
  • Figure 5: Visual generation results with different style selections. In each group, the first line is our model's results, and the second line is the corresponding results of the teacher models.
  • ...and 18 more figures