Table of Contents
Fetching ...

Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching

Aditi Singhania, Arushi Jain, Krutik Malani, Riddhi Dhawan, Souymodip Chakraborty, Vineet Batra, Ankit Phogat

TL;DR

The paper tackles the challenge of identity preservation in subject-driven diffusion under diverse prompts. It presents Instruct Identity, a latent-concatenation diffusion framework that forms $z_0$ by stacking $z_{ ext{tgt}}$ and $z_{ ext{ref}}$ and optimizes with a masked $\,\mathcal{L}_{CFM}^{\mu}$ to limit updates to the target region, along with a two-stage Distilled Data Curation Framework and parameter-efficient LoRA fine-tuning. A training-free CHARIS evaluation framework assesses identity consistency, prompt adherence, region-level color fidelity, visual quality, and transformation diversity, enabling fine-grained comparisons. Empirical results show stronger identity fidelity and generation quality across diverse contexts, with scalable data curation and minimal architectural changes, offering practical impact for large-scale, subject-centric generation.

Abstract

Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.

Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching

TL;DR

The paper tackles the challenge of identity preservation in subject-driven diffusion under diverse prompts. It presents Instruct Identity, a latent-concatenation diffusion framework that forms by stacking and and optimizes with a masked to limit updates to the target region, along with a two-stage Distilled Data Curation Framework and parameter-efficient LoRA fine-tuning. A training-free CHARIS evaluation framework assesses identity consistency, prompt adherence, region-level color fidelity, visual quality, and transformation diversity, enabling fine-grained comparisons. Empirical results show stronger identity fidelity and generation quality across diverse contexts, with scalable data curation and minimal architectural changes, offering practical impact for large-scale, subject-centric generation.

Abstract

Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.

Paper Structure

This paper contains 29 sections, 6 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of our reference-conditioned diffusion model architecture.
  • Figure 2: Pipeline for constructing a subject-consistent, high-quality dataset via regional prompting and visual-language model (VLM) filtering. A cartoon boy character is generated with Flux using global and regional prompts, followed by out-painting to extend pose diversity. VLM filtering ensures semantic consistency, and the resulting dataset is used to fine-tune the Flux model for improved subject identity retention across varied contexts.
  • Figure 3: Training masked loss vs full loss.
  • Figure 4: Full loss vs masked loss outputs for given prompts.
  • Figure 5: Visual comparison of identity preservation across diverse subjects. Our method maintains character-defining features and colour fidelity compared to UNO wu2025lesstomore, DSD cai2025dsd, and OmniControl tan2025ominicontrol.