Table of Contents
Fetching ...

ScoreFusion: Fusing Score-based Generative Models via Kullback-Leibler Barycenters

Hao Liu, Junze Tony Ye, Jose Blanchet, Nian Si

TL;DR

ScoreFusion tackles fusing multiple pre-trained diffusion models to represent a target distribution with limited data by grounding fusion in KL barycenters. It learns barycenter weights on the simplex \boldsymbol{\lambda} via score matching in diffusion, yielding a parametric family with density p_{\boldsymbol{\lambda}}(x) \propto \prod_i p_i(x)^{\lambda_i}. The authors prove dimension-free convergence guarantees and present two fusion schemes—vanilla KL fusion and ScoreFusion—demonstrating improved data efficiency on MNIST calibration and enhanced population heterogeneity in portrait sampling. The work offers a principled alternative to naive checkpoint merging and extends naturally to other gradient-flow diffusion settings.

Abstract

We introduce ScoreFusion, a theoretically grounded method for fusing multiple pre-trained diffusion models that are assumed to generate from auxiliary populations. ScoreFusion is particularly useful for enhancing the generative modeling of a target population with limited observed data. Our starting point considers the family of KL barycenters of the auxiliary populations, which is proven to be an optimal parametric class in the KL sense, but difficult to learn. Nevertheless, by recasting the learning problem as score matching in denoising diffusion, we obtain a tractable way of computing the optimal KL barycenter weights. We prove a dimension-free sample complexity bound in total variation distance, provided that the auxiliary models are well-fitted for their own task and the auxiliary tasks combined capture the target well. The sample efficiency of ScoreFusion is demonstrated by learning handwritten digits. We also provide a simple adaptation of a Stable Diffusion denoising pipeline that enables sampling from the KL barycenter of two auxiliary checkpoints; on a portrait generation task, our method produces faces that enhance population heterogeneity relative to the auxiliary distributions.

ScoreFusion: Fusing Score-based Generative Models via Kullback-Leibler Barycenters

TL;DR

ScoreFusion tackles fusing multiple pre-trained diffusion models to represent a target distribution with limited data by grounding fusion in KL barycenters. It learns barycenter weights on the simplex \boldsymbol{\lambda} via score matching in diffusion, yielding a parametric family with density p_{\boldsymbol{\lambda}}(x) \propto \prod_i p_i(x)^{\lambda_i}. The authors prove dimension-free convergence guarantees and present two fusion schemes—vanilla KL fusion and ScoreFusion—demonstrating improved data efficiency on MNIST calibration and enhanced population heterogeneity in portrait sampling. The work offers a principled alternative to naive checkpoint merging and extends naturally to other gradient-flow diffusion settings.

Abstract

We introduce ScoreFusion, a theoretically grounded method for fusing multiple pre-trained diffusion models that are assumed to generate from auxiliary populations. ScoreFusion is particularly useful for enhancing the generative modeling of a target population with limited observed data. Our starting point considers the family of KL barycenters of the auxiliary populations, which is proven to be an optimal parametric class in the KL sense, but difficult to learn. Nevertheless, by recasting the learning problem as score matching in denoising diffusion, we obtain a tractable way of computing the optimal KL barycenter weights. We prove a dimension-free sample complexity bound in total variation distance, provided that the auxiliary models are well-fitted for their own task and the auxiliary tasks combined capture the target well. The sample efficiency of ScoreFusion is demonstrated by learning handwritten digits. We also provide a simple adaptation of a Stable Diffusion denoising pipeline that enables sampling from the KL barycenter of two auxiliary checkpoints; on a portrait generation task, our method produces faces that enhance population heterogeneity relative to the auxiliary distributions.
Paper Structure (49 sections, 14 theorems, 103 equations, 21 figures, 5 tables, 2 algorithms)

This paper contains 49 sections, 14 theorems, 103 equations, 21 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Suppose $\{\mu_1, \ldots, \mu_k \} \subset \mathcal{P}(\mathbb{R}^d)$ and for each $i = 1, \ldots, k,$$\mu_i$ is absolutely continuous with respect to the Lebesgue measure, with densities $p_1, \ldots, p_k$ respectively. Then, the distribution-level KL barycenter $\mu_{\boldsymbol{\lambda}}$ is uniq

Figures (21)

  • Figure 1: Top: Generations by the 1st auxiliary model alone, resembling the White male phenotype. Bottom: Generations by the 2nd auxiliary model, resembling the Asian female phenotype. These are model generations without using the KL barycenter sampler.
  • Figure 2: Left: i.i.d. samples from KL barycenter. Right: i.i.d. samples from checkpoint merging. Interpolation weights are $\boldsymbol{\lambda} = (0.5, 0.5)$ for both. The same text prompt as Figure \ref{['fig:biased']} was used: "a photo of a mathematics scientist, looking at the camera, ultra quality, sharp focus". Both approaches enhance ethnic diversity relative to the monolithic representations in Figure \ref{['fig:biased']}, but the KL barycenter approach also produces samples that embody a more ambiguous and rarer representation of gender and ethnicity, mitigating stereotypes.
  • Figure 3: Top row: KL barycenter. Bottom row: Checkpoint merging. The same Gaussian noise was used to seed all twelve images, the only difference being the interpolation approach (top vs bottom) and interpolation weight; from left to right, $\lambda_2 \in \{0, 0.2, 0.4, 0.6, 0.8, 1.0\}$ and $\lambda_1 = 1 - \lambda_2$. $\lambda_2=0$ and $\lambda_2=1$ each reduce to an original auxiliary (biased) SDXL model. Observe that the bottom row samples show an abrupt identity shift between $\lambda_2 = 0.2$ and $0.4$, whereas the top row shows a smoother transition from one demographic visual concept to another.
  • Figure 4: Empirical distribution of each model, projected onto a 2D gender semantic space. $x$ and $y$ coordinates are their CLIP scores. $\mathbf{E}_{img}$ are CLIP embeddings of each sample. $\mathbf{E}_{text}^{F}$ and $\mathbf{E}_{text}^M$ are embeddings of "a photo of a female scientist" and "a photo of a male scientist". Gender neutrality and diversity can be interpreted as the middle region between the two auxiliary models' unimodal clusters; KL barycenter samples from this unexplored region, whereas checkpoint merging induces a bimodal mixtures distribution.
  • Figure 5: KDE contours of the KL barycenter distribution under various $\lambda_2$ (denoted as $\lambda$ in legends) values, estimated using a bandwidth of $0.8$. Left: $\mathbf{E}_{text}^{F}, \mathbf{E}_{text}^M$ are text embeddings of "a photo of a female scientist" and "a photo of a male scientist". Right: $\mathbf{E}_{text}^{EA}, \mathbf{E}_{text}^{W}$ are text embeddings of "a photo of an East Asian scientist" and "a photo of a White scientist".
  • ...and 16 more figures

Theorems & Definitions (31)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Theorem 3
  • Theorem 4
  • Definition 1
  • Theorem 5
  • proof
  • Remark 1
  • Lemma 1
  • ...and 21 more