Table of Contents
Fetching ...

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

Benjamin Biggs, Arjun Seshadri, Yang Zou, Achin Jain, Aditya Golatkar, Yusheng Xie, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto

TL;DR

Large diffusion models face continual data updates and the need to forget or avoid memorization. Diffusion Soup trains separate finetuned checkpoints on data shards and averages their weights to form a single souped model, which—via a Taylor-linearization view—approximately samples from the geometric mean of the shard distributions, improving generalization and anti-memorization. It enables training-free continual learning and unlearning, demonstrates anti-memorization guarantees, and supports zero-shot style mixing, while often outperforming a monolithic paragon trained on the union and incurring no extra inference cost. The approach offers a practical, scalable path to provenance-aware diffusion modeling with broad applications and strong empirical results across domain specialization and aesthetics.

Abstract

We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which offers anti-memorization guarantees and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30% improvement in Image Reward (.34 $\to$ .44) on domain sharded data, and a 59% improvement in IR (.37 $\to$ .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 $\to$ 86.5 and 85.6 $\to$ 86.8). We demonstrate robust unlearning -- removing any individual domain shard only lowers performance by 1% in IR (.45 $\to$ .44) -- and validate our theoretical insights on anti-memorization using real data. Finally, we showcase Diffusion Soup's ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

TL;DR

Large diffusion models face continual data updates and the need to forget or avoid memorization. Diffusion Soup trains separate finetuned checkpoints on data shards and averages their weights to form a single souped model, which—via a Taylor-linearization view—approximately samples from the geometric mean of the shard distributions, improving generalization and anti-memorization. It enables training-free continual learning and unlearning, demonstrates anti-memorization guarantees, and supports zero-shot style mixing, while often outperforming a monolithic paragon trained on the union and incurring no extra inference cost. The approach offers a practical, scalable path to provenance-aware diffusion modeling with broad applications and strong empirical results across domain specialization and aesthetics.

Abstract

We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which offers anti-memorization guarantees and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30% improvement in Image Reward (.34 .44) on domain sharded data, and a 59% improvement in IR (.37 .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 86.5 and 85.6 86.8). We demonstrate robust unlearning -- removing any individual domain shard only lowers performance by 1% in IR (.45 .44) -- and validate our theoretical insights on anti-memorization using real data. Finally, we showcase Diffusion Soup's ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.
Paper Structure (32 sections, 6 theorems, 12 equations, 9 figures, 4 tables)

This paper contains 32 sections, 6 theorems, 12 equations, 9 figures, 4 tables.

Key Result

proposition thmcounterproposition

(Geometric Mean) Let $\nabla_{x_t} \log p^{(i)}(x_t)$ be the marginal score in eq:backward-diffusion where $x_t = \gamma_t x_0 + \sigma_t \epsilon$, with $x_0 \sim p^{(i)}(x_0)$ and $\epsilon \sim \mathcal{N}(0, I)$. Let $\epsilon_{w_i}(x_t, t, y)$ be a neural network with sufficient capacity traine

Figures (9)

  • Figure 1: Diffusion Soup Enables Three Distinct Applications. (1) Continual Learning & Unlearning: models trained on various data shards can be added to improve performance or subtracted when removal is necessary. (2) Zero-Shot Style Mixing: souping blends the styles into a hybrid of its components with no extra supervision. (3) Anti-Memorization: Diffusion Soup prevents memorization while capturing its high level style (note that we blur depictions of inputs in this subfigure).
  • Figure 2: Images from Diffusion Soup Beat SD2.1 and a Combined Paragon. We visualize images generated by averaging the weights of various models finetuned on data shards spanning different categories (Souped), and compare them to images from the pretrained model (SD2.1) and a paragon model trained all the shards (Combined). These images highlight Diffusion Soup's dominance in metrics (See Table \ref{['table:mixture_10k_new']}) for Text-Image Alignment, Aesthetics, and Fidelity. Best viewed in color and zoomed in.
  • Figure 3: Finetuning Models on Category Specific Data Subsets Produces Specialists. We visualize results from finetuning SD2.1 models on various data subsets shards by category. The finetuning process specializes the diffusion model to the subset's category: for example, the Body Parts dataset enhances the prominence of fingers, and the Fashion Clothes dataset enhances outfits. These models can be souped together to obtain generalized models that outperform all specialists (See Table \ref{['table:mixture_10k_new']}).
  • Figure 4: Removing Any Individual Data Shard Does Not Meaningfully Reduce Performance. We soup specialists leaving one out at a time to demonstrate that no individual specialist significantly affects the quality of the generalist. Our results show that Diffusion Soup can be used for Machine Unlearning. From left-to-right, graphs show TIFA, Image Reward and CLIP Score. Performance of uniform soup model is in green, and SD2.1 in red.
  • Figure 5: Diffusion Soup merges finetuned models to create hybrid styles. We apply Diffusion Soup to models finetuned on Pokemon (Row 1) and FS-COCO (Row 2) to create a hybrid style (Row 3). The results are zero-shot since we do not have examples of the hybrid style for training.
  • ...and 4 more figures

Theorems & Definitions (6)

  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proposition thmcounterproposition