Table of Contents
Fetching ...

Model-Agnostic Human Preference Inversion in Diffusion Models

Jeeyung Kim, Ze Wang, Qiang Qiu

TL;DR

To address the high inference cost of diffusion models, the paper targets one-step generation ($L=1$), where the initial noise $x_T$ heavily shapes the output. It proposes Prompt Adaptive Human Preference Inversion (PAHI), a lightweight, model-agnostic framework that learns and tailors the initial noise via a global Gaussian prior and a prompt-specific noise-predicting model. Using human-preference scorers such as PickScore and ImageReward, the authors demonstrate that optimizing the noise prior yields substantial image-quality gains with only marginal compute overhead and a modest parameter increase (~5M). This work underscores the pivotal role of the noise prior in diffusion sampling and offers a practical pathway to efficient, high-quality text-to-image synthesis without diffusion-model fine-tuning.

Abstract

Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality one-step image generation aligning with human preferences, particularly focusing on exploring the impact of the prior noise distribution. Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the noise distributions for each prompt based on human preferences without the need for fine-tuning diffusion models. Our experiments showcase that the tailored noise distributions significantly improve image quality with only a marginal increase in computational cost. Our findings underscore the importance of noise optimization and pave the way for efficient and high-quality text-to-image synthesis.

Model-Agnostic Human Preference Inversion in Diffusion Models

TL;DR

To address the high inference cost of diffusion models, the paper targets one-step generation (), where the initial noise heavily shapes the output. It proposes Prompt Adaptive Human Preference Inversion (PAHI), a lightweight, model-agnostic framework that learns and tailors the initial noise via a global Gaussian prior and a prompt-specific noise-predicting model. Using human-preference scorers such as PickScore and ImageReward, the authors demonstrate that optimizing the noise prior yields substantial image-quality gains with only marginal compute overhead and a modest parameter increase (~5M). This work underscores the pivotal role of the noise prior in diffusion sampling and offers a practical pathway to efficient, high-quality text-to-image synthesis without diffusion-model fine-tuning.

Abstract

Efficient text-to-image generation remains a challenging task due to the high computational costs associated with the multi-step sampling in diffusion models. Although distillation of pre-trained diffusion models has been successful in reducing sampling steps, low-step image generation often falls short in terms of quality. In this study, we propose a novel sampling design to achieve high-quality one-step image generation aligning with human preferences, particularly focusing on exploring the impact of the prior noise distribution. Our approach, Prompt Adaptive Human Preference Inversion (PAHI), optimizes the noise distributions for each prompt based on human preferences without the need for fine-tuning diffusion models. Our experiments showcase that the tailored noise distributions significantly improve image quality with only a marginal increase in computational cost. Our findings underscore the importance of noise optimization and pave the way for efficient and high-quality text-to-image synthesis.
Paper Structure (8 sections, 7 equations, 1 figure, 2 tables)

This paper contains 8 sections, 7 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The prompts generated by users and the corresponding images sampled in one-step from the standard Gaussian (left) and the predicted noise distributions (right).