Table of Contents
Fetching ...

LCM-Lookahead for Encoder-based Text-to-Image Personalization

Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or

TL;DR

This work focuses on encoder-based personalization approaches, and demonstrates that by tuning them with a lookahead identity loss, they can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment.

Abstract

Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.

LCM-Lookahead for Encoder-based Text-to-Image Personalization

TL;DR

This work focuses on encoder-based personalization approaches, and demonstrates that by tuning them with a lookahead identity loss, they can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment.

Abstract

Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.
Paper Structure (21 sections, 4 equations, 6 figures, 3 tables)

This paper contains 21 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: LCM output alignment. We first denoise an image partway using DDPM ho2020denoising sampling with a baseline SDXL model podell2024sdxl. We then complete sampling in two manners: By performing a single LCM step, or by approximating the clean image using DDPM. Even at early timesteps, the LCM outputs provide a good approximation of the final DDPM prediction. This also holds for personalized models (e.g., a LoRA trained on the DreamBooth ruiz2022dreambooth dog, bottom). The numbers below each column indicate the fraction of standard DDPM denoising steps completed before applying the single-step prediction.
  • Figure 2: (left)Encoder architecture: Our encoder has two branches: One is the standard IP-Adapter ye2023ipadapter that provides conditions through a new cross-attention head. The second branch is a copy of the SDXL U-net from which we extract self-attention keys and values, which we concatenate with those of the main denoising branch. (right)Training setup: The two encoder paths are provided with a conditioning image (and its noisy latent), and their outputs are used to condition the denoising of a different image of the same subject. We denoise the image with both the baseline SDXL podell2024sdxl model and an LCM-model luo2023lcmlora. The baseline model's output is used for calculating the standard diffusion loss (\ref{['eq:l_simple']}). The LCM output is used to calculate the lookhead identity loss (\ref{['eq:lh_loss']}). We portray latents as images for visual clarity.
  • Figure 3: Consistent Data. Consistent data generated using SDXL-Turbo with the description "old man with curly hair and a moustache" incorporated into different prompts (e.g. "as an oil painting", "as a wanted poster")
  • Figure 4: LCM-Lookahead Guidance. Results of classifier guidance when using different classifiers on top of our LCM-Lookahead, or standard $\hat{x}_0$ approximation. Each classifier preserves different attributes of the guiding image. $\hat{x}_0$ guidance may result in reduced quality or visible artifacts. Identity similarity values ($\uparrow$, measured using huang2020curricularface) are shown at the bottom.
  • Figure 5: Qualitative results. Our method can personalize a model to specific face identities at inference time, and align with both photo realistic and stylized prompts.
  • ...and 1 more figures