Table of Contents
Fetching ...

Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts

Yasin Esfandiari, Stefan Bauer, Sebastian U. Stich, Andrea Dittadi

TL;DR

This work tackles the persistent trade-off between perceptual sample quality and data likelihood in diffusion models. It introduces a plug-and-play approach that merges two pretrained diffusion experts by hard-switching between them along the denoising trajectory at a switching threshold $\eta$, leveraging an image-quality expert at high noise and a likelihood-focused expert at low noise, without retraining. Across CIFAR-10 and ImageNet32, the merged model consistently matches or surpasses both base experts on FID and likelihood, demonstrating that noise-level informed switching can break the apparent trade-off. The method is modular and data-efficient, relying on existing pretrained models, with future directions including automated switching, integration with advanced samplers, and extensions to latent or consistency-based diffusion variants.

Abstract

Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.

Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts

TL;DR

This work tackles the persistent trade-off between perceptual sample quality and data likelihood in diffusion models. It introduces a plug-and-play approach that merges two pretrained diffusion experts by hard-switching between them along the denoising trajectory at a switching threshold , leveraging an image-quality expert at high noise and a likelihood-focused expert at low noise, without retraining. Across CIFAR-10 and ImageNet32, the merged model consistently matches or surpasses both base experts on FID and likelihood, demonstrating that noise-level informed switching can break the apparent trade-off. The method is modular and data-efficient, relying on existing pretrained models, with future directions including automated switching, integration with advanced samplers, and extensions to latent or consistency-based diffusion variants.

Abstract

Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.

Paper Structure

This paper contains 15 sections, 17 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Diagram of our merged model where at an intermediate time $\eta \in [0,1]$ we switch between denoisers. Note that the likelihood model is only used for almost imperceptible noise levels. This significantly improves the likelihood, which is sensitive to low-level color statistics, while leaving the FID unaffected.
  • Figure 2: Likelihood--quality trade-off on CIFAR-10. Likelihood is measured in BPD using PF ODE integration with truncated normal dequantization (ODE (TN)) or the variational lower bound (VLB). Perceptual quality is measured with FID with both deterministic (ODE sampler) and stochastic (VDM ancestral) integration. The x-axis corresponds to the switching threshold $\eta$ between the models. The EDM and VDM base models correspond to $\eta = 0$ and $\eta = 1$, respectively.
  • Figure 3: Qualitative comparison on CIFAR-10 using the ODE sampler. Each row starts from the same noise sample $\mathbf{z}_1$, while columns vary the threshold $\eta$ in the merged model. The model follows EDM dynamics up to time $\eta$ and then switches to VDM for $t<\eta$. Increasing $\eta$ triggers an earlier switch, improving likelihood but gradually reducing perceptual fidelity.
  • Figure 4: Likelihood--quality trade-off on ImageNet32. Likelihood is measured in BPD using PF ODE integration with truncated normal dequantization (ODE (TN)) or the variational lower bound (VLB). Perceptual quality is measured with FID with both deterministic (ODE sampler) and stochastic (VDM ancestral) integration. The x-axis corresponds to the switching threshold $\eta$ between the models. The EDM and VDM base models correspond to $\eta = 0$ and $\eta = 1$, respectively.
  • Figure 5: Generated images from our merged model using different thresholds $\eta$ on ImageNet32 dataset.
  • ...and 4 more figures