Table of Contents
Fetching ...

ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

Yohai Mazuz, Janna Bruner, Lior Wolf

TL;DR

ConsiStyle addresses the challenge of maintaining consistent character identity across diverse styles in text-to-image generation without subject-specific training. It achieves this by a three-stage, training-free framework that stores style-relevant values, computes cross-image correspondences, and performs selective Q/K transfer along with AdaIN-based attention crossing to prevent style leakage. Experimental results show improved prompt alignment and style fidelity with competitive subject consistency, corroborated by a user study favoring the proposed method for style and text alignment. This approach enables flexible, style-diverse storytelling and animation workflows by decoupling style from identity while preserving prompt fidelity.

Abstract

In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject's appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that achieves both style alignment and subject consistency. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.

ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

TL;DR

ConsiStyle addresses the challenge of maintaining consistent character identity across diverse styles in text-to-image generation without subject-specific training. It achieves this by a three-stage, training-free framework that stores style-relevant values, computes cross-image correspondences, and performs selective Q/K transfer along with AdaIN-based attention crossing to prevent style leakage. Experimental results show improved prompt alignment and style fidelity with competitive subject consistency, corroborated by a user study favoring the proposed method for style and text alignment. This approach enables flexible, style-diverse storytelling and animation workflows by decoupling style from identity while preserving prompt fidelity.

Abstract

In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject's appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that achieves both style alignment and subject consistency. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.

Paper Structure

This paper contains 15 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Consistent character generation across diverse styles. Our method preserves key characteristics such as patterns and colors while adhering to the style specified in each prompt. In contrast, SDXL aligns with the prompt and style but fails to maintain consistency across different prompts.
  • Figure 2: Overview of our method, illustrating the attention modification and crossing components.
  • Figure 3: Qualitative comparison of our method with Consistory, DB-LoRA, and IP-Adapter demonstrates its effectiveness across varying text descriptions, character consistency, and style alignment. Unlike other methods, our approach preserves character features and maintains consistent appearance while faithfully adhering to the specified style and textual descriptions.
  • Figure 4: Harmonization. Our method preserves the desired style, seamlessly integrating characters into stylized contexts such as cartoons or illustrations. It adapts both the appearance and the setting, e.g., casting firelit shadows on a dragon or applying a pinkish tone to a kitten in a similar environment.
  • Figure 5: Demonstration of the method's limitations. The first row illustrates inconsistencies in generating a complex object (spaceship), where high visual detail leads to variation across images. The second row highlights a failure to align with a distinct style—specifically, the Papercraft Collage style, evident in the face details.
  • ...and 8 more figures