Table of Contents
Fetching ...

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski

TL;DR

This work tackles the challenge of generating consistent characters from a single text prompt in text-to-image diffusion. It introduces an automated, iterative pipeline that clusters a gallery of generated images, extracts a cohesive identity via personalization, and refines the prompt-conditioned representation until convergence. The approach balances prompt alignment with identity consistency better than baselines, as evidenced by quantitative metrics and a user study, and enables practical applications in story visualization and editing. While promising, the method notes computational costs and limitations in handling nuanced supporting elements and spurious attributes, suggesting directions for efficiency and broader applicability.

Abstract

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

TL;DR

This work tackles the challenge of generating consistent characters from a single text prompt in text-to-image diffusion. It introduces an automated, iterative pipeline that clusters a gallery of generated images, extracts a cohesive identity via personalization, and refines the prompt-conditioned representation until convergence. The approach balances prompt alignment with identity consistency better than baselines, as evidenced by quantitative metrics and a user study, and enables practical applications in story visualization and editing. While promising, the method notes computational costs and limitations in handling nuanced supporting elements and spurious attributes, suggesting directions for efficiency and broader applicability.

Abstract

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.
Paper Structure (33 sections, 2 equations, 24 figures, 2 tables, 1 algorithm)

This paper contains 33 sections, 2 equations, 24 figures, 2 tables, 1 algorithm.

Figures (24)

  • Figure 1: Identity consistency. Given the prompt "a Plasticine of a cute baby cat with big eyes", a standard text-to-image diffusion model produces different cats (all corresponding to the input text), whereas our method produces the same cat.
  • Figure 2: Method overview. Given an input text prompt, we start by generating numerous images using the text-to-image model $M_{\Theta}$, which are embedded into a semantic feature space using the feature extractor $F$. Next, these embeddings are clustered and the most cohesive group is chosen, since it contains images with shared characteristics. The "common ground" among the images in this set is used to refine the representation ${\Theta}$ to better capture and fit the target. These steps are iterated until convergence to a consistent identity.
  • Figure 3: Embedding visualization. Given generated images for the text prompt "a sticker of a ginger cat", we project the set $S$ of their high-dimensional embeddings into 2D using t-SNE Hinton2002StochasticNE and indicate different K-MEANS++ Arthur2007kmeansTA clusters using different colors. Representative images are shown for three of the clusters. It may be seen that images in each cluster share the same characteristics: black cluster --- full body cats, red cluster --- cat heads, brown cluster --- images with multiple cats. According to our cohesion measure \ref{['eq:cohesion']}, the black cluster is the most cohesive, and therefore, chosen for identity extraction (or refinement).
  • Figure 4: Qualitative comparison. We compare our method against several baselines: TI Gal2022AnII, BLIP-diffusion Li2023BLIPDiffusionPS and IP-adapter Ye2023IPAdapterTC are able to follow the target prompts, but do not preserve a consistent identity. LoRA DB lora_diffusion is able to maintain consistency, but it does not always follow the prompt. Furthermore, the character is generated in the same fixed pose. ELITE Wei2023ELITEEV struggles with prompt following and also tends to generate deformed characters. On the other hand, our method is able to follow the prompt and maintain consistent identities, while generating the characters in different poses and viewing angles.
  • Figure 5: Quantitative Comparison and User Study. (Left) We compared our method quantitatively with various baselines in terms of identity consistency and prompt similarity, as explained in \ref{['sec:comparisons']}. LoRA DB and ELITE maintain high identity consistency, while sacrificing prompt similarity. TI and BLIP-diffusion achieve high prompt similarity but low identity consistency. We also ablated some components of our method: removing the clustering stage, reducing the optimizable representation, re-initializing the representation in each iteration and performing only a single iteration. All of the ablated cases resulted in a significant degradation of consistency. (Right) The user study rankings also demonstrate that our method is balancing between identity consistency and prompt similarity.
  • ...and 19 more figures