Conceptrol: Concept Control of Zero-shot Personalized Image Generation
Qiyuan He, Angela Yao
TL;DR
Conceptrol addresses the challenge of zero-shot personalized image generation by integrating textual concepts into diffusion-based adapters. It identifies that treating a reference image as a global condition misaligns attention and harms adherence to prompts, and leverages textual concept masks extracted from concept-specific attention blocks to constrain visual specifications. The method is training-free and plug-and-play, applying an attention mask to guide where the personalized content should appear, and includes a warmup mechanism to stabilize early-generation guidance. Across UNet-based IP-Adapter, DiT-based OminiControl, and multiple base models, Conceptrol substantially improves concept preservation and prompt following, often surpassing fine-tuning methods like DreamBooth LoRA on benchmark tasks, with negligible overhead. This work demonstrates the critical role of integrating textual concepts into personalization pipelines to achieve robust, high-fidelity, and controllable image generation.
Abstract
Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at https://github.com/QY-H00/Conceptrol.
