Table of Contents
Fetching ...

CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

Zhanxin Gao, Beier Zhu, Liang Yao, Jian Yang, Ying Tai

TL;DR

CoDi tackles the challenge of subject-consistent yet pose-diverse text-to-image generation by a training-free two-stage diffusion strategy. Identity Transport (IT) uses optimal transport in early denoising steps to mosaic the reference subject into target images while preserving pose, followed by Identity Refinement (IR) in later steps with selective cross-attention to refine identity details. This combination yields stronger subject consistency without sacrificing pose/layout diversity, validated on the ConsiStory+ benchmark with favorable quantitative metrics and user study results. The approach is scalable to multi-subject generation and can adapt to different styles, offering a practical, efficient solution for coherent visual storytelling and character design in diffusion-based T2I systems.

Abstract

Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.

CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

TL;DR

CoDi tackles the challenge of subject-consistent yet pose-diverse text-to-image generation by a training-free two-stage diffusion strategy. Identity Transport (IT) uses optimal transport in early denoising steps to mosaic the reference subject into target images while preserving pose, followed by Identity Refinement (IR) in later steps with selective cross-attention to refine identity details. This combination yields stronger subject consistency without sacrificing pose/layout diversity, validated on the ConsiStory+ benchmark with favorable quantitative metrics and user study results. The approach is scalable to multi-subject generation and can adapt to different styles, offering a practical, efficient solution for coherent visual storytelling and character design in diffusion-based T2I systems.

Abstract

Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.

Paper Structure

This paper contains 24 sections, 21 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Comparison of subject-consistent generation methods: Vanilla SDXL podell2023sdxl, ConsiStory tewel2024training, StoryDiffusion zhou2024storydiffusion and CoDi (ours). (a&b) Existing methods sacrifice pose diversity for subject consistency, e.g., ConsiStory produces similar poses in Figure \ref{['fig:teaser']} (a); and the lower right with hands placed in front in Figure \ref{['fig:teaser']} (b). In contrast, CoDi generates consistent subjects, while matching the pose diversity of Vanilla SDXL. (c) Subject consistency vs. pose diversity. Current methods struggle to balance the two, whereas CoDi achieves both effectively.
  • Figure 2: Illustration of our CoDi. (a) Extract subject masks ($M_\mathsf{id}$ and $M_n$) by averaging the image-text cross-attention at the final denoising timestep for subject-related tokens (e.g., "fairy"). (b) Compute the OT plan $T_n$ using the cost matrix $C$ and the probability masses $\mathbf{a}$ and $\mathbf{b}$ (detailed in Sec. \ref{['section:stage1']}). (c) Identity transport (IT) operates in the early denoising steps to transfer reference subject features to targe images in a pose-aware manner. (d) Identity refinement (IR) operates in the late denoising steps to refine subject details using selective cross-image attention mechanism.
  • Figure 3: Qualitative comparison among Vanilla SDXL podell2023sdxl, ConsiStory tewel2024training, StoryDiffusion zhou2024storydiffusion, and 1Prompt1Story liu2025one. ConsiStory and StoryDiffusion generate similar poses across examples, while 1Prompt1Story preserves pose diversity but struggles with subject consistency. In contrast, our CoDi achieves both.
  • Figure 4: Main component analysis (qualitative) on identity transport (IT) and identity refinement (IR). IT enhances subject consistency in the coarse-grained level and preserves pose diversity. IR enhances subject consistency in the fine-grained level reduces pose diversity. Their combination yields the best consistency and preserves diversity.
  • Figure 5: Ablation studies on (a) stage transition point, and (b) the effect of $\alpha$.
  • ...and 12 more figures