Table of Contents
Fetching ...

Zero-Shot Personalization of Objects via Textual Inversion

Aniket Roy, Maitreya Suin, Rama Chellappa

Abstract

Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.

Zero-Shot Personalization of Objects via Textual Inversion

Abstract

Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.
Paper Structure (20 sections, 5 equations, 12 figures, 5 tables)

This paper contains 20 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Using only a single input image and a concept extraction network, our method is able to personalize a text-to-image diffusion model much faster than test-time optimzation-based approaches, while maintaining the subject’s uniqueness and details.
  • Figure 2: An overview of our proposed approach. We first obtain the 'ground-truth' textual-embedding of the concepts present in the training set using test-time optimization based-methods like Textual-Inversion. Next, we train our Concept-Extraction network, to produce that embedding given a single image and a text-template. Once it is trained, we further fine-tune the cross-attention layers of the diffusion model using the modified textual-embeddings obtained from the frozen Concept-Extraction network. During inference, the Concept-Extraction network along with the fine-tuned diffusion model can be used to generate variations of the subject present in a given image, without requiring expensive optimization steps.
  • Figure 3: tSNE plot for textual inversion embeddings. We observe that the textual inversion embeddings of images of same identity are clustered.
  • Figure 4: Comparison with existing methods (Dreambooth (DB), Custom diffusion (CD), Textual inversion (TI)) on Custom101 dataset.
  • Figure 5: Qualitative comparison with existing methods on Custom101 dataset.
  • ...and 7 more figures