Table of Contents
Fetching ...

Conceptrol: Concept Control of Zero-shot Personalized Image Generation

Qiyuan He, Angela Yao

TL;DR

Conceptrol addresses the challenge of zero-shot personalized image generation by integrating textual concepts into diffusion-based adapters. It identifies that treating a reference image as a global condition misaligns attention and harms adherence to prompts, and leverages textual concept masks extracted from concept-specific attention blocks to constrain visual specifications. The method is training-free and plug-and-play, applying an attention mask to guide where the personalized content should appear, and includes a warmup mechanism to stabilize early-generation guidance. Across UNet-based IP-Adapter, DiT-based OminiControl, and multiple base models, Conceptrol substantially improves concept preservation and prompt following, often surpassing fine-tuning methods like DreamBooth LoRA on benchmark tasks, with negligible overhead. This work demonstrates the critical role of integrating textual concepts into personalization pipelines to achieve robust, high-fidelity, and controllable image generation.

Abstract

Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at https://github.com/QY-H00/Conceptrol.

Conceptrol: Concept Control of Zero-shot Personalized Image Generation

TL;DR

Conceptrol addresses the challenge of zero-shot personalized image generation by integrating textual concepts into diffusion-based adapters. It identifies that treating a reference image as a global condition misaligns attention and harms adherence to prompts, and leverages textual concept masks extracted from concept-specific attention blocks to constrain visual specifications. The method is training-free and plug-and-play, applying an attention mask to guide where the personalized content should appear, and includes a warmup mechanism to stabilize early-generation guidance. Across UNet-based IP-Adapter, DiT-based OminiControl, and multiple base models, Conceptrol substantially improves concept preservation and prompt following, often surpassing fine-tuning methods like DreamBooth LoRA on benchmark tasks, with negligible overhead. This work demonstrates the critical role of integrating textual concepts into personalization pipelines to achieve robust, high-fidelity, and controllable image generation.

Abstract

Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at https://github.com/QY-H00/Conceptrol.

Paper Structure

This paper contains 18 sections, 11 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: We propose Conceptrol, a training-free control method that markedly improves the customization capabilities of zero-shot adapters. As the first row shows, adapters exhibit issues such as copy-paste artifacts (e.g., duplicating the book) and mismatched visual specifications (e.g., displaying a red book or inconsistent statute). In contrast, Conceptrol accurately preserves the identity while strictly following the text prompt, and it can be applied to multiple personalized generations following the reference on statue and yellow book simultaneously. Our method supports different base models (Stable Diffusion, SDXL and FLUX), personalized targets (e.g., objects and styles), and model parameters (e.g., SDXL and Juggernaut XL), all while without computation overhead, training data or auxiliary models.
  • Figure 2: Overview of Conceptrol. Conceptrol extracts a textual concept mask indicating the region of interest for a textual concept (e.g., "statue"), from a concept-specific block (e.g., UP BLOCK 0.1.3 in SDXL). It then adjusts the attention to the corresponding visual specification (i.e., a personalized image of the statue) in the adapters accordingly to enhance personalization.
  • Figure 3: Treating image conditions globally can be problematic. The first row shows IP-Adapter on Stable Diffusion 1.5 with varying IP Scales, where increasing scale shifts the output from "a statue reading a book" to "a book statue." The second row shows OminiControl on FLUX failing to preserve the color of the book as yellow but generating red books at different conditioning scales.
  • Figure 4: Incorrect attention map of image conditions. This example illustrates IP-Adapter results with fully text-based input and with additional image condition added at the 10th of 50 total denoising steps. The blue box shows the attention map of 'avocado,' while the red box highlights the image condition, which incorrectly focuses on the dog area as well with the given avocado image, distorting results and reducing text prompt adherence.
  • Figure 5: Not all attention maps of textual concept strongly indicate the interest area. Shown are examples from the FLUX model including (a) generated results for "dolphin", (b) segmentation results from SAM indicating the subject, and attention map with "dolphin" from (c) BLOCK 18 and (d) BLOCK 11.
  • ...and 12 more figures