Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching
Aditi Singhania, Arushi Jain, Krutik Malani, Riddhi Dhawan, Souymodip Chakraborty, Vineet Batra, Ankit Phogat
TL;DR
The paper tackles the challenge of identity preservation in subject-driven diffusion under diverse prompts. It presents Instruct Identity, a latent-concatenation diffusion framework that forms $z_0$ by stacking $z_{ ext{tgt}}$ and $z_{ ext{ref}}$ and optimizes with a masked $\,\mathcal{L}_{CFM}^{\mu}$ to limit updates to the target region, along with a two-stage Distilled Data Curation Framework and parameter-efficient LoRA fine-tuning. A training-free CHARIS evaluation framework assesses identity consistency, prompt adherence, region-level color fidelity, visual quality, and transformation diversity, enabling fine-grained comparisons. Empirical results show stronger identity fidelity and generation quality across diverse contexts, with scalable data curation and minimal architectural changes, offering practical impact for large-scale, subject-centric generation.
Abstract
Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.
