The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski
TL;DR
This work tackles the challenge of generating consistent characters from a single text prompt in text-to-image diffusion. It introduces an automated, iterative pipeline that clusters a gallery of generated images, extracts a cohesive identity via personalization, and refines the prompt-conditioned representation until convergence. The approach balances prompt alignment with identity consistency better than baselines, as evidenced by quantitative metrics and a user study, and enables practical applications in story visualization and editing. While promising, the method notes computational costs and limitations in handling nuanced supporting elements and spurious attributes, suggesting directions for efficiency and broader applicability.
Abstract
Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, the users that use these models struggle with the generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development, asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach.
