Table of Contents
Fetching ...

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, Ziyong Feng

TL;DR

IDAdapter tackles the challenge of personalized text-to-image generation from a single face image without test-time fine-tuning. It introduces Mixed Facial Features (MFF) to fuse identity cues from multiple reference images and uses adapter-based visual injection plus textual injection to embed a personalized concept, guided by a face identity loss. The approach decouples identity from non-identity attributes to enable diversity in style, pose, and expression while preserving identity. Empirical results show strong identity fidelity and higher diversity than prior methods, with efficient training on a single GPU and tuning-free inference, enhancing practicality for personalized avatars.

Abstract

Leveraging Stable Diffusion for the generation of personalized portraits has emerged as a powerful and noteworthy tool, enabling users to create high-fidelity, custom character avatars based on their specific prompts. However, existing personalization methods face challenges, including test-time fine-tuning, the requirement of multiple input images, low preservation of identity, and limited diversity in generated outcomes. To overcome these challenges, we introduce IDAdapter, a tuning-free approach that enhances the diversity and identity preservation in personalized image generation from a single face image. IDAdapter integrates a personalized concept into the generation process through a combination of textual and visual injections and a face identity loss. During the training phase, we incorporate mixed features from multiple reference images of a specific identity to enrich identity-related content details, guiding the model to generate images with more diverse styles, expressions, and angles compared to previous works. Extensive evaluations demonstrate the effectiveness of our method, achieving both diversity and identity fidelity in generated images.

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

TL;DR

IDAdapter tackles the challenge of personalized text-to-image generation from a single face image without test-time fine-tuning. It introduces Mixed Facial Features (MFF) to fuse identity cues from multiple reference images and uses adapter-based visual injection plus textual injection to embed a personalized concept, guided by a face identity loss. The approach decouples identity from non-identity attributes to enable diversity in style, pose, and expression while preserving identity. Empirical results show strong identity fidelity and higher diversity than prior methods, with efficient training on a single GPU and tuning-free inference, enhancing practicality for personalized avatars.

Abstract

Leveraging Stable Diffusion for the generation of personalized portraits has emerged as a powerful and noteworthy tool, enabling users to create high-fidelity, custom character avatars based on their specific prompts. However, existing personalization methods face challenges, including test-time fine-tuning, the requirement of multiple input images, low preservation of identity, and limited diversity in generated outcomes. To overcome these challenges, we introduce IDAdapter, a tuning-free approach that enhances the diversity and identity preservation in personalized image generation from a single face image. IDAdapter integrates a personalized concept into the generation process through a combination of textual and visual injections and a face identity loss. During the training phase, we incorporate mixed features from multiple reference images of a specific identity to enrich identity-related content details, guiding the model to generate images with more diverse styles, expressions, and angles compared to previous works. Extensive evaluations demonstrate the effectiveness of our method, achieving both diversity and identity fidelity in generated images.
Paper Structure (17 sections, 9 equations, 16 figures, 2 tables)

This paper contains 17 sections, 9 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Given a single facial photo of as the reference and a text prompt, our proposed method can generate images in a variety of styles, angles, and expressions without any test-time fine-tuning at the inference stage. The results exhibit dressing-up modifications, viewpoint control, recontextualization, art renditions, property alteration, as well as emotion integration, while preserving high fidelity to the face.
  • Figure 2: The overview of IDAdapter training. In each optimization step, we randomly select $N$ different images of the same identity. We label the faces of all the reference images "[class noun]" (e.g. "woman", "man", etc.), and regard the text description and the reference images as a training pair. The features extracted from the reference images are then fused using a mixed facial features (MFF) module, which provides the model with rich detailed identity information and possibilities for variation. At the inference stage, only a single image is required, which is replicated to form a set of $N$ reference images.
  • Figure 3: Binding non-identity (non-ID) information vs. decoupling ID and non-ID information. Most of the existing generation methods bind the identifier word to non-ID information and rarely exhibit changes in facial expressions, lighting, poses, etc. Our method decouples ID and non-ID information and can generate high-fidelity images with diversity of styles, expressions, and angles (text prompt of the example: "man in the snow, happy")
  • Figure 4: Architecture of MFF: Our MFF consists of a learnable transformer implemented with two attention blocks that translates identity feature $\textbf{f}_a$ and patch feature $\textbf{f}_v$ into a latent MFF vision embedding $\textbf{E}_r$, which will be injected to the self-attention layers of the UNet through adapters.
  • Figure 5: Comparisons with several baseline methods. IDAdapter is stronger in the diversity of properties, poses, expressions and other non-ID appearance, achieving very strong editability while preserving identity.
  • ...and 11 more figures