Table of Contents
Fetching ...

CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts

Lee Hsin-Ying, Kelvin C. K. Chan, Ming-Hsuan Yang

TL;DR

CoCoIns tackles the problem of subject inconsistency in long-form content generated by diffusion models without requiring fine-tuning or reference images. It introduces a lightweight mapping network that turns latent codes into pseudo-words inserted into prompts, binding specific concept instances through a self-supervised contrastive loss. The approach achieves competitive subject consistency with state-of-the-art tuning-free methods on single-subject faces and shows promising extensions to multi-subject and general concepts. This framework offers a flexible, scalable path for controllable content creation in domains like storytelling and comics.

Abstract

While text-to-image generative models can synthesize diverse and faithful content, subject variation across multiple generations limits their application to long-form content generation. Existing approaches require time-consuming fine-tuning, reference images for all subjects, or access to previously generated content. We introduce Contrastive Concept Instantiation (CoCoIns), a framework that effectively synthesizes consistent subjects across multiple independent generations. The framework consists of a generative model and a mapping network that transforms input latent codes into pseudo-words associated with specific concept instances. Users can generate consistent subjects by reusing the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to distinguish between different combinations of prompts and latent codes. Extensive evaluations on human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending CoCoIns to multiple subjects and other object categories.

CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts

TL;DR

CoCoIns tackles the problem of subject inconsistency in long-form content generated by diffusion models without requiring fine-tuning or reference images. It introduces a lightweight mapping network that turns latent codes into pseudo-words inserted into prompts, binding specific concept instances through a self-supervised contrastive loss. The approach achieves competitive subject consistency with state-of-the-art tuning-free methods on single-subject faces and shows promising extensions to multi-subject and general concepts. This framework offers a flexible, scalable path for controllable content creation in domains like storytelling and comics.

Abstract

While text-to-image generative models can synthesize diverse and faithful content, subject variation across multiple generations limits their application to long-form content generation. Existing approaches require time-consuming fine-tuning, reference images for all subjects, or access to previously generated content. We introduce Contrastive Concept Instantiation (CoCoIns), a framework that effectively synthesizes consistent subjects across multiple independent generations. The framework consists of a generative model and a mapping network that transforms input latent codes into pseudo-words associated with specific concept instances. Users can generate consistent subjects by reusing the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to distinguish between different combinations of prompts and latent codes. Extensive evaluations on human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending CoCoIns to multiple subjects and other object categories.

Paper Structure

This paper contains 22 sections, 9 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Contrastive Concept Instantiation (CoCoIns) is a generation framework that achieves subject consistency across multiple independent generations without fine-tuning or reference images. Unlike prior work that requires customization fine-tuning, adopts an additional encoder for references, or generates images in batches, CoCoIns creates instances of concepts with unique associations connecting latent codes to subject instances. Given a latent code (o/x), CoCoIns converts it into a pseudo-word ([o]/[x]) that determines the appearance of a subject concept. By reusing the same code, users can consistently generate the same subject instances across generations.
  • Figure 2: Overview of Contrastive Concept Instantiation (CoCoIns). We develop a contrastive learning approach to build associations between input latent codes and concept instances. For each training image, we generate two descriptions and randomly sample two latent codes $\bm{z}_1$ and $\bm{z}_2$. The mapping network transforms the latent codes into pseudo-words $\bm{w}_1$ and $\bm{w}_2$. We then construct a triplet of combinations of descriptions and latent codes. We build (a) an anchor sample with description embedding $\bm{e}^*$ modulated by inserting $\bm{w}_1$ before the target concept token, (b) a positive sample $\bm{e}^+$ with a similar description modulated with $\bm{w}_1$, and (c) a negative sample $\bm{e}^-$ with the same prompt as the anchor but modulated with a different pseudo-word $\bm{w}_2$. The network is trained with a triplet loss to differentiate approximated images $\hat{\bm{x}}^*$, $\hat{\bm{x}}^+$, and $\hat{\bm{x}}^-$ from denoiser predictions $\hat{\epsilon}^*$, $\hat{\epsilon}^+$, and $\hat{\epsilon}^-$.
  • Figure 3: Qualitative comparisons on Portraits from (a) StoryDiffusion zhou2024storydiffusion, (b) Consistory tewel2024consistory, (c) 1Prompt1Story liu2025onepromptonestory, (d) DreamBooth ruiz2022dreambooth, and (e) CoCoIns. The left four columns are generated with a man as the subject, and the right four with a woman. We achieve subject consistency without generating images in batches or performing reference fine-tuning.
  • Figure 4: Qualitative comparisons on Scenes from (a) StoryDiffusion zhou2024storydiffusion, (b) Consistory tewel2024consistory, (c) DreamBooth ruiz2022dreambooth, and (d) CoCoIns. 1Prompt1Story liu2025onepromptonestory is absent because it requires a specific prompt format with unified subject descriptions. The left and right four columns show two different subjects in diverse contexts. The prompts can be found in \ref{['sec:data-details']}.
  • Figure 5: Performance of prompt and noise combinations for constructing training triplets. Compared to using the same prompts ($=$) or noise for the anchor and positive samples, using two different ($\neq$) prompts and noise yields the best face similarity and diversity with similar prompt fidelity.
  • ...and 8 more figures