Table of Contents
Fetching ...

Key-Locked Rank One Editing for Text-to-Image Personalization

Yoad Tewel, Rinon Gal, Gal Chechik, Yuval Atzmon

TL;DR

Perfusion introduces Key-Locked Rank One Editing for text-to-image personalization, addressing overfitting, fidelity, and cross-concept composition. By locking cross-attention Keys to super-categories and applying gated rank-1 edits to Keys and Values, it achieves high object fidelity with a 100KB-per-concept footprint and supports runtime combination of multiple concepts. The method aligns training and inference with an end-to-end, gated ROME-inspired update and enables flexible control over visual-textual balance without retraining. Empirically, Perfusion outperforms baselines on qualitative and quantitative measures and reveals novel interactions in personalized scenes, including one-shot results.

Abstract

Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

Key-Locked Rank One Editing for Text-to-Image Personalization

TL;DR

Perfusion introduces Key-Locked Rank One Editing for text-to-image personalization, addressing overfitting, fidelity, and cross-concept composition. By locking cross-attention Keys to super-categories and applying gated rank-1 edits to Keys and Values, it achieves high object fidelity with a 100KB-per-concept footprint and supports runtime combination of multiple concepts. The method aligns training and inference with an end-to-end, gated ROME-inspired update and enables flexible control over visual-textual balance without retraining. Empirically, Perfusion outperforms baselines on qualitative and quantitative measures and reveals novel interactions in personalized scenes, including one-shot results.

Abstract

Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.
Paper Structure (34 sections, 17 equations, 22 figures, 1 algorithm)

This paper contains 34 sections, 17 equations, 22 figures, 1 algorithm.

Figures (22)

  • Figure 1: Attention overfit: Typical overfit in Textual-Inversion (TI), caused by the attention of the learned embedding taking over the whole image. Here we visualize the attention maps that correspond to the "dog*" word. The TI attention regions (right panel) are spread across the entire image rather than focusing on the object. This leads the generative process to ignore the rest of the prompt and depict only the "dog*" concept.
  • Figure 2: Architecture outline (A): A prompt is transformed into a sequence of encodings. Each encoding is fed to a set of cross-attention modules (purple blocks) of a diffusion U-Net denoiser. Zoomed-in purple module shows how the Key and Value pathways are conditioned on the text encoding. The Key drives the attention map, which then modulates the Value pathway. Gated Rank-1 Edit (B):Top: The K pathway is locked so any encoding of $e_\text{Hugsy}$ that reaches $\hat{W}_k$ is mapped to the key of the super-category $K^\text{teddy}$. Bottom: Any encoding of $e_\text{Hugsy}$ that reaches $\hat{W}_v$, is mapped to $V^\text{Hugsy}$, which is learned. The gated aspect of this update allows to selectively apply it to only the necessary encodings and provides means for regulating the strength of learned concept, as expressed in the output images.
  • Figure 3: Generation results with single concept examples. For each concept, we show exemplars from our training set, along with generated images, their conditioning texts and comparisons to Custom-Diffusion (CD) and Dreambooth (DB) baselines. Perfusion can enable more animate results, with better prompt-matching and less susceptibility to background traits from the original image. Note in particular the improved garments and theatrics on our cat (top), or the prompt-appropriate gaze and posture when instructing our dog to read a book (bottom). For some prompts, the baselines simply copy the content from the training set (e.g. the pot).
  • Figure 4: Additional generation results with multi concept examples. We show pairs of concepts interacting, and compare to CD. Except for the teddy* prompt, all prompts are from CD paper and use the images provided by the paper. In the teddy* example, Perfusion portrays it with the sunglasses*, while CD omits the sunglasses*. In the watercolor painting Perfusion better preserve the chair shape. In the table* example, Perfusion better preserve the table color.
  • Figure 5: Visual - Textual Similarity Plane: With just a single 100KB trained model and run-time parameter choices, Perfusion (blue and cyan) spans the Pareto front. Error bars denote 95% confidence intervals.
  • ...and 17 more figures