Table of Contents
Fetching ...

PALP: Prompt Aligned Personalization of Text-to-Image Models

Moab Arar, Andrey Voynov, Amir Hertz, Omri Avrahami, Shlomi Fruchter, Yael Pritch, Daniel Cohen-Or, Ariel Shamir

TL;DR

PALP tackles the trade-off between persona fidelity and prompt fidelity in text-to-image personalization by separating learning into a subject-specific path and a prompt-aligned score-guidance path. It introduces Delta Denoising Score, a two-branch training objective that steers denoising toward a target prompt while preserving the personalized subject, mitigating overfitting and mode collapse. The method supports single-shot and multi-subject scenarios and demonstrates strong text alignment with complex prompts, often outperforming existing personalization baselines in both qualitative and CLIP-based quantitative measures. Practically, PALP enables richer scene composition, including art-inspired prompts and cross-subject compositions, with potential for prompt-specific adapters to enable instant per-prompt personalization in real-time use cases.

Abstract

Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a \emph{single} prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.

PALP: Prompt Aligned Personalization of Text-to-Image Models

TL;DR

PALP tackles the trade-off between persona fidelity and prompt fidelity in text-to-image personalization by separating learning into a subject-specific path and a prompt-aligned score-guidance path. It introduces Delta Denoising Score, a two-branch training objective that steers denoising toward a target prompt while preserving the personalized subject, mitigating overfitting and mode collapse. The method supports single-shot and multi-subject scenarios and demonstrates strong text alignment with complex prompts, often outperforming existing personalization baselines in both qualitative and CLIP-based quantitative measures. Practically, PALP enables richer scene composition, including art-inspired prompts and cross-subject compositions, with potential for prompt-specific adapters to enable instant per-prompt personalization in real-time use cases.

Abstract

Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a \emph{single} prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.
Paper Structure (33 sections, 7 equations, 12 figures, 9 tables)

This paper contains 33 sections, 7 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Prompt aligned personalization allow rich and complex scene generation, including all elements of a condition prompt (right).
  • Figure 2: Previous personalization methods struggle with complex prompts (e.g., "A sketch inspired by Vitruvian man") presenting a trade-off between prompt-alignment and subject-fidelity. Our method, optimizes for both, without compromising either.
  • Figure 3: PALP for multi-subject personalization achieves coherent and prompt-aligned results. Our method works when the subject has only one image (e.g., the "Wanderer above the Sea of Fog" artwork by Caspar David Friedrich).
  • Figure 4: Method overview. We propose a framework consisting of a personalization path (left) and a prompt-alignment branch (right) applied simultaneously in the same training step. We achieve personalization by finetuning the pre-trained model using a simple reconstruction loss to denoise the new subject $S$. To keep the model aligned with the target prompt, we additionally use score sampling to pivot the prediction towards the direction of the target prompt $y$, e.g., "A sketch of a cat." In this example, when personalization and text alignment are optimized simultaneously, the network learns to denoise the subject towards a "sketch" like representation. Finally, our method does not induce a significant memory overhead due to the efficient estimation of the score function, following SDS.
  • Figure 5: Visualization of $\hat{x}_0$. We visualize the model estimation of $\hat{x}_0$ given a pure-noise and the prompt "A sketch of [V]." The base model (b) is not personalized to the target subject and predicts mainly the "Sketch" appearance. Personalization methods (c) tend to overfit the input image where many image elements, including the background and the subject colors are restored, suggesting the model does not consider the prompt condition. Prompt aligned personalization (c) maintains the sketchiness and does not overfit (see cat-like shape).
  • ...and 7 more figures