Table of Contents
Fetching ...

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano

TL;DR

This work addresses the challenge of personalizing text-to-image diffusion models across diverse concepts without relying on domain-specific datasets. It introduces a domain-agnostic tuning-encoder that combines a nearest-neighbor contrastive embedding regularization with a HyperNetwork that predicts low-rank LoRA-style weight modulations, enabling inference-time tuning in as few as $12$ steps and reducing memory usage. The approach achieves high-quality personalization across multiple domains, matching or surpassing state-of-the-art encoders and optimization-based methods while requiring only a single example. This has practical impact by making rapid, personalized image synthesis accessible on more modest hardware, accelerating creative workflows.

Abstract

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

TL;DR

This work addresses the challenge of personalizing text-to-image diffusion models across diverse concepts without relying on domain-specific datasets. It introduces a domain-agnostic tuning-encoder that combines a nearest-neighbor contrastive embedding regularization with a HyperNetwork that predicts low-rank LoRA-style weight modulations, enabling inference-time tuning in as few as steps and reducing memory usage. The approach achieves high-quality personalization across multiple domains, matching or surpassing state-of-the-art encoders and optimization-based methods while requiring only a single example. This has practical impact by making rapid, personalized image synthesis accessible on more modest hardware, accelerating creative workflows.

Abstract

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.
Paper Structure (25 sections, 5 equations, 5 figures)

This paper contains 25 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: Method overview. (top) Our method consists of a feature-extraction backbone which follows the E4T approach and uses a mix of CLIP-features from the concept image, and denoiser-based features from the current noisy generation. These features are fed into an embedding prediction head, and a hypernetwork which predicts LoRA-style attention-weight offsets. (bottom, right) Our embeddings are regularized by using a nearest-neighbour based contrastive loss that pushes them towards real words, but away from the embeddings of other concepts. (bottom, left) We employ a dual-path adaptation approach where each attention branch is repeated twice, once using the soft-embedding and the hypernetwork offsets, and once with the vanilla model and a hard-prompt containing the embedding's nearest neighbor. These branches are linearly blended to better preserve the prior.
  • Figure 2: Qualitative comparison with existing methods. Our method achieves comparable quality to the state-of-the-art using only a single image and $12$ or fewer training steps. Notably, it generalizes to unique objects which recent encoder-based methods struggle with.
  • Figure 3: The effects of removing or changing the embedding regularization. Removal of regularization leads to overfitting or mode collapse with poor quality results. Naı̈ve regularizations tend to struggle with preserving the concept details. Our contrastive-based regularization can achieve a tradeoff between the two.
  • Figure 4: Quantitative evaluation results. (a) Comparisons to prior work. Our method presents an appealing point on the identity-prompt similarity trade-off curve, while being orders of magnitude quicker than optimization-based methods. (b) Ablation study results. Removing regularization typically leads to quick overfitting, where editability suffers. Skipping the fine-tuning step harms identity preservation, in line with E4T gal2023designing.
  • Figure 5: Additional qualitative results generated using our method. The left-most column shows the input image, followed by 4 personalized generations for each subject.