Breaking the Likelihood-Quality Trade-off in Diffusion Models by Merging Pretrained Experts
Yasin Esfandiari, Stefan Bauer, Sebastian U. Stich, Andrea Dittadi
TL;DR
This work tackles the persistent trade-off between perceptual sample quality and data likelihood in diffusion models. It introduces a plug-and-play approach that merges two pretrained diffusion experts by hard-switching between them along the denoising trajectory at a switching threshold $\eta$, leveraging an image-quality expert at high noise and a likelihood-focused expert at low noise, without retraining. Across CIFAR-10 and ImageNet32, the merged model consistently matches or surpasses both base experts on FID and likelihood, demonstrating that noise-level informed switching can break the apparent trade-off. The method is modular and data-efficient, relying on existing pretrained models, with future directions including automated switching, integration with advanced samplers, and extensions to latent or consistency-based diffusion variants.
Abstract
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
