Table of Contents
Fetching ...

Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

Jianxiang Lu, Cong Xie, Hui Guo

TL;DR

This work tackles the challenge of one-shot fine-tuning for inserting user-specified objects into text-to-image diffusion outputs while preserving object identity and enabling diverse contexts. It introduces an object-driven framework that initializes a prototypical embedding from multimodal cues (image, mask, and class text) and employs a class-characterizing regularization along with an object-specific loss to balance fidelity and generalization. Implemented on a Stable Diffusion backbone with LoRA, the approach supports single and multi-object implantation and demonstrates superior fidelity-generalization trade-offs against baselines like DreamBooth, Textual Inversion, and LoRA. The method yields high-quality, controllable, and scalable personalization for content creation, though it faces mask-edge and tiny-object fidelity challenges that inform future improvements such as multi-scale perception and improved object masks.

Abstract

As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object's appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization. Our method outperforms several existing works. The code will be released.

Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

TL;DR

This work tackles the challenge of one-shot fine-tuning for inserting user-specified objects into text-to-image diffusion outputs while preserving object identity and enabling diverse contexts. It introduces an object-driven framework that initializes a prototypical embedding from multimodal cues (image, mask, and class text) and employs a class-characterizing regularization along with an object-specific loss to balance fidelity and generalization. Implemented on a Stable Diffusion backbone with LoRA, the approach supports single and multi-object implantation and demonstrates superior fidelity-generalization trade-offs against baselines like DreamBooth, Textual Inversion, and LoRA. The method yields high-quality, controllable, and scalable personalization for content creation, though it faces mask-edge and tiny-object fidelity challenges that inform future improvements such as multi-scale perception and improved object masks.

Abstract

As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object's appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization. Our method outperforms several existing works. The code will be released.
Paper Structure (20 sections, 5 equations, 8 figures, 1 table)

This paper contains 20 sections, 5 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Methodology overview. Our method takes an input image along with its corresponding masks and relevant class names as input, generating object-specific text embeddings and personalized LoRA weights. During inference, the text embedding and LoRA weight is combined with other features to generate a wide range of variations for the object.
  • Figure 2: Fine-tuning details. Given one image with single or multiple objects, our method fine-tunes a text-to-image diffusion model. Taking single object as an example, our method utilizes prototypical embedding for initialization and employs class-characterizing regularization to enhance generation diversity, along with a class-specific loss function to ensure fidelity of the synthesized images.
  • Figure 3: Qualitative comparison. For one-shot tasks, existing methods face challenges in achieving both fidelity and generalizability with the given text. Our method generates images that better match the reference image and are consistent with the text semantics under multiple cue words. Note that the * symbol represents a unique identifier.
  • Figure 4: Prototypical embedding initialization. Our proposed method, utilizing prototypical embedding as the initialization, ensures the generation of images that are more contextually relevant.
  • Figure 5: Quantitative assessment. We visualize the metrics for each method, the point towards the lower right, the better performance of the method.
  • ...and 3 more figures