Table of Contents
Fetching ...

ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models

Kiymet Akdemir, Pinar Yanardag

TL;DR

The paper tackles the problem of maintaining consistent character representations across contexts in text-to-image diffusion. It introduces ORACLE, a three-stage pipeline that first generates a grid of candidate characters from a single prompt, next refines this set via mutual-information-based outlier filtering, and finally personalizes a LoRA model on the refined set to enable cross-context generation. Empirical results—qualitative, quantitative CLIP-based metrics, and a user study—show that ORACLE achieves a favorable balance between faithfully following prompts and preserving character identity, outperforming baselines such as The Chosen One, IP-Adapter, and LoRA-DB. This approach enables rapid, cohesive character design for comics, games, education, and related creative workflows by reducing manual curation and enabling consistent visualization across scenes and media.

Abstract

Text-to-image diffusion models have recently taken center stage as pivotal tools in promoting visual creativity across an array of domains such as comic book artistry, children's literature, game development, and web design. These models harness the power of artificial intelligence to convert textual descriptions into vivid images, thereby enabling artists and creators to bring their imaginative concepts to life with unprecedented ease. However, one of the significant hurdles that persist is the challenge of maintaining consistency in character generation across diverse contexts. Variations in textual prompts, even if minor, can yield vastly different visual outputs, posing a considerable problem in projects that require a uniform representation of characters throughout. In this paper, we introduce a novel framework designed to produce consistent character representations from a single text prompt across diverse settings. Through both quantitative and qualitative analyses, we demonstrate that our framework outperforms existing methods in generating characters with consistent visual identities, underscoring its potential to transform creative industries. By addressing the critical challenge of character consistency, we not only enhance the practical utility of these models but also broaden the horizons for artistic and creative expression.

ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models

TL;DR

The paper tackles the problem of maintaining consistent character representations across contexts in text-to-image diffusion. It introduces ORACLE, a three-stage pipeline that first generates a grid of candidate characters from a single prompt, next refines this set via mutual-information-based outlier filtering, and finally personalizes a LoRA model on the refined set to enable cross-context generation. Empirical results—qualitative, quantitative CLIP-based metrics, and a user study—show that ORACLE achieves a favorable balance between faithfully following prompts and preserving character identity, outperforming baselines such as The Chosen One, IP-Adapter, and LoRA-DB. This approach enables rapid, cohesive character design for comics, games, education, and related creative workflows by reducing manual curation and enabling consistent visualization across scenes and media.

Abstract

Text-to-image diffusion models have recently taken center stage as pivotal tools in promoting visual creativity across an array of domains such as comic book artistry, children's literature, game development, and web design. These models harness the power of artificial intelligence to convert textual descriptions into vivid images, thereby enabling artists and creators to bring their imaginative concepts to life with unprecedented ease. However, one of the significant hurdles that persist is the challenge of maintaining consistency in character generation across diverse contexts. Variations in textual prompts, even if minor, can yield vastly different visual outputs, posing a considerable problem in projects that require a uniform representation of characters throughout. In this paper, we introduce a novel framework designed to produce consistent character representations from a single text prompt across diverse settings. Through both quantitative and qualitative analyses, we demonstrate that our framework outperforms existing methods in generating characters with consistent visual identities, underscoring its potential to transform creative industries. By addressing the critical challenge of character consistency, we not only enhance the practical utility of these models but also broaden the horizons for artistic and creative expression.
Paper Structure (20 sections, 7 equations, 10 figures)

This paper contains 20 sections, 7 equations, 10 figures.

Figures (10)

  • Figure 1: Given a text prompt such as 'a cute child with curly chair, cartoon style' (refer to the top row), our approach seamlessly produces consistent characters in a zero-shot manner by leveraging a pre-trained Stable Diffusion model. It ensures character consistency across a wide array of settings and backgrounds, demonstrating the versatility and practicality of our method. Our method has the potential to enhance creative process in art and design, enabling more detailed storytelling and consistent character portrayal in animations, video games, and interactive media.
  • Figure 2: An overview of ORACLE. Our method operates through three phases: 1) It begins with the generation of a grid based on structured prompts that include character description, style, and a grid generator prompt, like "from different angles". 2) Subsequently, it calculates the average pairwise mutual information to identify potential outliers. 3) Once outliers are filtered out, a personalized model is trained using the refined grid segments.
  • Figure 3: Qualitative results. Our method can produce a wide array of characters in diverse contexts and styles, from imaginative figures like 'a bulldog wearing a jacket' and 'a pink owl', to photo-realistic characters such as 'a woman with a purple scarf'.
  • Figure 4: Quantitative comparisons. We use CLIP to assess the relevance of images to their prompts (image-prompt similarity) and identity consistency (image-image similarity).
  • Figure 5: User study results. The average user rating for each baseline is given for two types of questions (identity consistency and relevance to prompt). Rating is performed on a scale from 1 to 5.
  • ...and 5 more figures