Table of Contents
Fetching ...

Flux Already Knows -- Activating Subject-Driven Image Generation without Training

Hao Kang, Stathi Fotiadis, Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Min Jin Chong, Xin Lu

TL;DR

The paper tackles zero-shot subject-driven image generation without training or data, proposing mosaic-based conditioning in latent space with a vanilla Flux diffusion model. Its LatentUnfold framework encodes a subject, tiles latents into an M×N mosaic, and performs targeted inpainting of a designated panel, aided by cascade attention and meta prompting to preserve identity and fulfill edit prompts. Key contributions include mosaic-driven subject preservation, the LatentUnfold inference pipeline, a cascaded attention mechanism for cross-scale identity consistency, and a meta-prompting scheme that leverages an MLLM to generate more effective prompts. Experiments on DreamBooth data show competitive identity and text alignment metrics against data- or training-dependent baselines, and applications demonstrate versatile edits such as logo insertion and virtual try-on, highlighting a lightweight, practical path for downstream customization with foundation models.

Abstract

We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This "free lunch" approach is further strengthened by a novel cascade attention design and meta prompting technique, boosting fidelity and versatility. Experimental results show that our method outperforms baselines across multiple key metrics in benchmarks and human preference studies, with trade-offs in certain aspects. Additionally, it supports diverse edits, including logo insertion, virtual try-on, and subject replacement or insertion. These results demonstrate that a pre-trained foundational text-to-image model can enable high-quality, resource-efficient subject-driven generation, opening new possibilities for lightweight customization in downstream applications.

Flux Already Knows -- Activating Subject-Driven Image Generation without Training

TL;DR

The paper tackles zero-shot subject-driven image generation without training or data, proposing mosaic-based conditioning in latent space with a vanilla Flux diffusion model. Its LatentUnfold framework encodes a subject, tiles latents into an M×N mosaic, and performs targeted inpainting of a designated panel, aided by cascade attention and meta prompting to preserve identity and fulfill edit prompts. Key contributions include mosaic-driven subject preservation, the LatentUnfold inference pipeline, a cascaded attention mechanism for cross-scale identity consistency, and a meta-prompting scheme that leverages an MLLM to generate more effective prompts. Experiments on DreamBooth data show competitive identity and text alignment metrics against data- or training-dependent baselines, and applications demonstrate versatile edits such as logo insertion and virtual try-on, highlighting a lightweight, practical path for downstream customization with foundation models.

Abstract

We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This "free lunch" approach is further strengthened by a novel cascade attention design and meta prompting technique, boosting fidelity and versatility. Experimental results show that our method outperforms baselines across multiple key metrics in benchmarks and human preference studies, with trade-offs in certain aspects. Additionally, it supports diverse edits, including logo insertion, virtual try-on, and subject replacement or insertion. These results demonstrate that a pre-trained foundational text-to-image model can enable high-quality, resource-efficient subject-driven generation, opening new possibilities for lightweight customization in downstream applications.

Paper Structure

This paper contains 26 sections, 13 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: We introduce a streamlined framework for subject-driven generation using a vanilla Flux model, with no training, no inference-time tuning, and no additional data. By leveraging mosaic-formatted image conditions and framing the task as mosaic image completion, it supports both single and multiple views, preserving subject identity and adhering to edit prompts, yielding diverse, high-fidelity results.
  • Figure 2: The top row illustrates the LatentUnfold pipeline, while the bottom row showcases the novel Cascade Attention mechanism.
  • Figure 3: The attention visualizations, both with and without Cascade Attention, are shown. The reference image is displayed on the right. Notice the enhanced attention, leading to better detail preservation on the toy's teeth and belly.
  • Figure 4: Qualitative results of subject-driven tasks on novel objects. The top row displays reference images, while the corresponding text prompts are listed below each generated image.
  • Figure 5: Our method demonstrates robust performance in the virtual try-on task, where both the human model and the scene are generated through the control of text prompts. The reference garments are displayed as small images in the bottom-left corner of each image, and the corresponding text prompts are listed below each generated image.
  • ...and 7 more figures