Model-Agnostic Human Preference Inversion in Diffusion Models
Jeeyung Kim, Ze Wang, Qiang Qiu
TL;DR
To address the high inference cost of diffusion models, the paper targets one-step generation ($L=1$), where the initial noise $x_T$ heavily shapes the output. It proposes Prompt Adaptive Human Preference Inversion (PAHI), a lightweight, model-agnostic framework that learns and tailors the initial noise via a global Gaussian prior and a prompt-specific noise-predicting model. Using human-preference scorers such as PickScore and ImageReward, the authors demonstrate that optimizing the noise prior yields substantial image-quality gains with only marginal compute overhead and a modest parameter increase (~5M). This work underscores the pivotal role of the noise prior in diffusion sampling and offers a practical pathway to efficient, high-quality text-to-image synthesis without diffusion-model fine-tuning.
Abstract
Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality one-step image generation aligning with human preferences, particularly focusing on exploring the impact of the prior noise distribution. Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the noise distributions for each prompt based on human preferences without the need for fine-tuning diffusion models. Our experiments showcase that the tailored noise distributions significantly improve image quality with only a marginal increase in computational cost. Our findings underscore the importance of noise optimization and pave the way for efficient and high-quality text-to-image synthesis.
