Table of Contents
Fetching ...

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

Sogand Salehi, Mahdi Shafiei, Teresa Yeo, Roman Bachmann, Amir Zamir

TL;DR

ViPer tackles personalized image generation by learning an individual's visual preferences from free-form comments in a one-time process. It converts comments into structured visual attributes via a Vision Preference Extractor and conditions Stable Diffusion with VP embeddings through the prompt update $p = E(p) + \beta (E(VP_+) - E(VP_-))$, where $\beta \le 1$, optionally leveraging classifier-free guidance. A proxy evaluator $M$ predicts whether a given image would be liked, enabling scalable evaluation beyond costly human studies. Empirical results show ViPer outperforms baselines in aligning generations with individual tastes and offers controllable personalization strength, making it practical for broad deployment. The approach advances personalized generative AI by obviating iterative prompt engineering and fine-tuning while preserving adherence to user prompts and diverse visual preferences.

Abstract

Different users find different images generated for the same prompt desirable. This gives rise to personalized image generation which involves creating images aligned with an individual's visual preference. Current generative models are, however, unpersonalized, as they are tuned to produce outputs that appeal to a broad audience. Using them to generate images aligned with individual users relies on iterative manual prompt engineering by the user which is inefficient and undesirable. We propose to personalize the image generation process by first capturing the generic preferences of the user in a one-time process by inviting them to comment on a small selection of images, explaining why they like or dislike each. Based on these comments, we infer a user's structured liked and disliked visual attributes, i.e., their visual preference, using a large language model. These attributes are used to guide a text-to-image model toward producing images that are tuned towards the individual user's visual preference. Through a series of user studies and large language model guided evaluations, we demonstrate that the proposed method results in generations that are well aligned with individual users' visual preferences.

ViPer: Visual Personalization of Generative Models via Individual Preference Learning

TL;DR

ViPer tackles personalized image generation by learning an individual's visual preferences from free-form comments in a one-time process. It converts comments into structured visual attributes via a Vision Preference Extractor and conditions Stable Diffusion with VP embeddings through the prompt update , where , optionally leveraging classifier-free guidance. A proxy evaluator predicts whether a given image would be liked, enabling scalable evaluation beyond costly human studies. Empirical results show ViPer outperforms baselines in aligning generations with individual tastes and offers controllable personalization strength, making it practical for broad deployment. The approach advances personalized generative AI by obviating iterative prompt engineering and fine-tuning while preserving adherence to user prompts and diverse visual preferences.

Abstract

Different users find different images generated for the same prompt desirable. This gives rise to personalized image generation which involves creating images aligned with an individual's visual preference. Current generative models are, however, unpersonalized, as they are tuned to produce outputs that appeal to a broad audience. Using them to generate images aligned with individual users relies on iterative manual prompt engineering by the user which is inefficient and undesirable. We propose to personalize the image generation process by first capturing the generic preferences of the user in a one-time process by inviting them to comment on a small selection of images, explaining why they like or dislike each. Based on these comments, we infer a user's structured liked and disliked visual attributes, i.e., their visual preference, using a large language model. These attributes are used to guide a text-to-image model toward producing images that are tuned towards the individual user's visual preference. Through a series of user studies and large language model guided evaluations, we demonstrate that the proposed method results in generations that are well aligned with individual users' visual preferences.
Paper Structure (27 sections, 2 equations, 22 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 2 equations, 22 figures, 6 tables, 1 algorithm.

Figures (22)

  • Figure 1: We introduce ViPer, a method that personalizes the output of generative models to align with different users' preferences for the same prompt. This is done via a one-time capture of the user's general preferences and conditioning the generative model on them without the need for engineered detailed prompts. Notice how the results vary for the same prompt for different users based on their visual preferences.
  • Figure 2: (Left) Capturing an individual's preference from comments. We ask interested individuals to comment on a small set of images. These images are generated such that they have diverse styles (see \ref{['sec:commentimages']} for visuals and further details). Note that these comments are not required to be of a specific structure. Thus, users can write as much or as little as they like. Obviously, more detailed and expressive comments would lead to better personalization. We show an example of an individual's comments on two images. Our method allows for such free-form comments as we make use of a language model to extract structured preferences, i.e. a visual preference extractor ($VPE$). To get the $VPE$, we fine-tune IDEFICS2-8b. Converting the user's free-form comments into structured visual preferences is a one-time process that provides us with a concise representation of the user's preferences. In this example, $VPE$ extracts the individual's preferences for the color palette, vibe, and lighting from the comments and translates them into attributes like "Rich and Muted Color Palettes", "Gloomy", and "Highlights and Shadows". (Right) Conditioning Stable Diffusion on an individual's visual preference. The user's visual preferences are then encoded and added to the prompt embedding (see \ref{['eq:embeddingaddtion']}). This allows us to steer the generations towards certain styles. They're also used in the guidance formula (see \ref{['eq:guidance']}) directly to guide Stable Diffusion's results toward the user's preferences. This step does not require any fine-tuning and can be used directly with the Stable Diffusion model. Note how the generations reflect the user's preference by generating a "Gloomy" style while avoiding styles like "Contemporary Abstract", "Vibrant Colors", and "Naive Art".
  • Figure 3: Learning a proxy measure for evaluating personalized generations. We fine-tune IDEFICS2-8b, denoted by $M$, using both the personalized/liked and non-personalized/disliked set of images, $\mathcal{X}$, of an agent. $M$ is given the set of images $\mathcal{X}$ from an individual for context and asked to predict if the user will like a given query image. We trained the model with cross-entropy loss.
  • Figure 4: Comparing the personalized generations between users. Each row shows images generated from the same prompt shown on the left. The first column displays generations without any personalization, while the next four columns show personalized generations for users with distinct preferences. Note that the generations are consistent with the input prompts, and each user's preferences are reflected across prompts. Moreover, while the color palette is a dominating visual attribute and can be noticed at first glance, other visual attributes such as brush strokes, lines, vibe, etc., also have intricate effects (see last row where colors are specified through input prompt). Users' visual preferences are included in \ref{['tb:VPS']}.
  • Figure 5: Controlling the degree of personalization. Each row displays images generated from the same prompt. The left column is generated without any personalization, i.e., $\beta = 0$ in the guidance equation in \ref{['eq:guidance']}, while the next five columns increase this number by 0.2 consecutively. This user's preference for soft, simple, dreamy vibes, and pastel colors increases in intensity as $\beta$ increases.
  • ...and 17 more figures