Table of Contents
Fetching ...

Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter

Peng Xing, Ning Wang, Jianbo Ouyang, Zechao Li

TL;DR

Inv-Adapter tackles ID customization by extracting diffusion-domain representations of a prompt image via DDIM inversion and injecting them into a pre-trained text-to-image model with a lightweight Embedded Attention Adapter. By abandoning extra image encoders and training only the 48M-parameter EAA, it achieves high identity fidelity and generation quality while maintaining efficiency. Quantitative and qualitative results on CelebA-HQ/FFHQ-derived benchmarks show strong face fidelity (FACE-SIM, CLIP-I, DINO) and solid loyalty with reduced model scale compared to prior methods. The approach offers practical deployment benefits and shows promise for broader use with diffusion-based personalization, though it highlights data diversity and inversion-speed limitations as avenues for future improvement.

Abstract

The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high fidelity and high-efficiency requirements. Their main bottleneck lies in the prompt image encoder, which produces weak alignment signals with the text-to-image model and significantly increased model size. Towards this end, we propose a lightweight Inv-Adapter, which first extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then embed them efficiently into the base text-to-image model by carefully designing a lightweight attention adapter. We conduct extensive experiments to assess ID fidelity, generation loyalty, speed, and training parameters, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.

Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter

TL;DR

Inv-Adapter tackles ID customization by extracting diffusion-domain representations of a prompt image via DDIM inversion and injecting them into a pre-trained text-to-image model with a lightweight Embedded Attention Adapter. By abandoning extra image encoders and training only the 48M-parameter EAA, it achieves high identity fidelity and generation quality while maintaining efficiency. Quantitative and qualitative results on CelebA-HQ/FFHQ-derived benchmarks show strong face fidelity (FACE-SIM, CLIP-I, DINO) and solid loyalty with reduced model scale compared to prior methods. The approach offers practical deployment benefits and shows promise for broader use with diffusion-based personalization, though it highlights data diversity and inversion-speed limitations as avenues for future improvement.

Abstract

The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high fidelity and high-efficiency requirements. Their main bottleneck lies in the prompt image encoder, which produces weak alignment signals with the text-to-image model and significantly increased model size. Towards this end, we propose a lightweight Inv-Adapter, which first extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then embed them efficiently into the base text-to-image model by carefully designing a lightweight attention adapter. We conduct extensive experiments to assess ID fidelity, generation loyalty, speed, and training parameters, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.
Paper Structure (12 sections, 6 equations, 10 figures, 4 tables)

This paper contains 12 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Comparison of popular scheme (IP-Adapter ye2023ip) with the proposed Inv-Adapter. From attention maps, the proposed Inv-Adapter captures detailed face features more accurately.
  • Figure 2: Overview of the proposed Inv-Adapter. First, in the diffusion feature extraction phase, the latent noise $Z_t^p$ is obtained by DDIM inversion of prompt image $I_f$. Then, in the denoising process on latent noise $Z_t^p$, we extract diffusion features, which are the intermediate representations of the diffusion model and contain detailed information about $Z_t^p$. The diffusion features are inserted into the same pre-trained text-to-image through the Embedded Attention Adapter (EAA) to generate the image that preserves ID information. Finally, the final result that aligns the prompts $P_c$ and $I_f$ is obtained after iterating the $T$ steps.
  • Figure 3: Visualization of the attention maps of the generated results with the prompt images. Our proposed Inv-Adapter employing diffusion features makes the prompt images focus only on the critical face region, which is ideal.
  • Figure 4: Left: the image generated by the SD model rombach2022high and the self attention and the cross attention maps in different steps. Right: generated results and attention visualisation of the proposed Inv-Adapter ablation experiment on the attention layers.
  • Figure 5: Comparison results of Inv-Adapter with recent advanced IP-Adapter ye2023ip, IP-Adapter-Plus ye2023ip, IP-Adapter-FaceID-Plus ye2023ip, PhotoMaker li2023photomaker, and InstantID wang2024instantid on Sample-1K.
  • ...and 5 more figures