CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts
Lee Hsin-Ying, Kelvin C. K. Chan, Ming-Hsuan Yang
TL;DR
CoCoIns tackles the problem of subject inconsistency in long-form content generated by diffusion models without requiring fine-tuning or reference images. It introduces a lightweight mapping network that turns latent codes into pseudo-words inserted into prompts, binding specific concept instances through a self-supervised contrastive loss. The approach achieves competitive subject consistency with state-of-the-art tuning-free methods on single-subject faces and shows promising extensions to multi-subject and general concepts. This framework offers a flexible, scalable path for controllable content creation in domains like storytelling and comics.
Abstract
While text-to-image generative models can synthesize diverse and faithful content, subject variation across multiple generations limits their application to long-form content generation. Existing approaches require time-consuming fine-tuning, reference images for all subjects, or access to previously generated content. We introduce Contrastive Concept Instantiation (CoCoIns), a framework that effectively synthesizes consistent subjects across multiple independent generations. The framework consists of a generative model and a mapping network that transforms input latent codes into pseudo-words associated with specific concept instances. Users can generate consistent subjects by reusing the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to distinguish between different combinations of prompts and latent codes. Extensive evaluations on human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending CoCoIns to multiple subjects and other object categories.
