Flux Already Knows -- Activating Subject-Driven Image Generation without Training
Hao Kang, Stathi Fotiadis, Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Min Jin Chong, Xin Lu
TL;DR
The paper tackles zero-shot subject-driven image generation without training or data, proposing mosaic-based conditioning in latent space with a vanilla Flux diffusion model. Its LatentUnfold framework encodes a subject, tiles latents into an M×N mosaic, and performs targeted inpainting of a designated panel, aided by cascade attention and meta prompting to preserve identity and fulfill edit prompts. Key contributions include mosaic-driven subject preservation, the LatentUnfold inference pipeline, a cascaded attention mechanism for cross-scale identity consistency, and a meta-prompting scheme that leverages an MLLM to generate more effective prompts. Experiments on DreamBooth data show competitive identity and text alignment metrics against data- or training-dependent baselines, and applications demonstrate versatile edits such as logo insertion and virtual try-on, highlighting a lightweight, practical path for downstream customization with foundation models.
Abstract
We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This "free lunch" approach is further strengthened by a novel cascade attention design and meta prompting technique, boosting fidelity and versatility. Experimental results show that our method outperforms baselines across multiple key metrics in benchmarks and human preference studies, with trade-offs in certain aspects. Additionally, it supports diverse edits, including logo insertion, virtual try-on, and subject replacement or insertion. These results demonstrate that a pre-trained foundational text-to-image model can enable high-quality, resource-efficient subject-driven generation, opening new possibilities for lightweight customization in downstream applications.
