Table of Contents
Fetching ...

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Zhengguang Zhou, Jing Li, Huaxia Li, Nemo Chen, Xu Tang

TL;DR

This paper introduces StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images.

Abstract

Tuning-free personalized image generation methods have achieved significant success in maintaining facial consistency, i.e., identities, even with multiple characters. However, the lack of holistic consistency in scenes with multiple characters hampers these methods' ability to create a cohesive narrative. In this paper, we introduce StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images. StoryMaker incorporates conditions based on face identities and cropped character images, which include clothing, hairstyles, and bodies. Specifically, we integrate the facial identity information with the cropped character images using the Positional-aware Perceiver Resampler (PPR) to obtain distinct character features. To prevent intermingling of multiple characters and the background, we separately constrain the cross-attention impact regions of different characters and the background using MSE loss with segmentation masks. Additionally, we train the generation network conditioned on poses to promote decoupling from poses. A LoRA is also employed to enhance fidelity and quality. Experiments underscore the effectiveness of our approach. StoryMaker supports numerous applications and is compatible with other societal plug-ins. Our source codes and model weights are available at https://github.com/RedAIGC/StoryMaker.

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

TL;DR

This paper introduces StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images.

Abstract

Tuning-free personalized image generation methods have achieved significant success in maintaining facial consistency, i.e., identities, even with multiple characters. However, the lack of holistic consistency in scenes with multiple characters hampers these methods' ability to create a cohesive narrative. In this paper, we introduce StoryMaker, a personalization solution that preserves not only facial consistency but also clothing, hairstyles, and body consistency, thus facilitating the creation of a story through a series of images. StoryMaker incorporates conditions based on face identities and cropped character images, which include clothing, hairstyles, and bodies. Specifically, we integrate the facial identity information with the cropped character images using the Positional-aware Perceiver Resampler (PPR) to obtain distinct character features. To prevent intermingling of multiple characters and the background, we separately constrain the cross-attention impact regions of different characters and the background using MSE loss with segmentation masks. Additionally, we train the generation network conditioned on poses to promote decoupling from poses. A LoRA is also employed to enhance fidelity and quality. Experiments underscore the effectiveness of our approach. StoryMaker supports numerous applications and is compatible with other societal plug-ins. Our source codes and model weights are available at https://github.com/RedAIGC/StoryMaker.
Paper Structure (26 sections, 8 equations, 5 figures, 1 table)

This paper contains 26 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Visualization of images generated by our StoryMaker. The first three rows depict a story about a day in the life of an "office worker," while the last two rows tell a story inspired by the movie "Before Sunrise."
  • Figure 2: he model architecture of our proposed StoryMaker. The facial image and character image are embedded using the face encoder and image encoder, respectively, and refined through our proposed Positional-aware Perceiver Resampler module. Decoupled cross-attention with LoRAs is employed to inject these embeddings into the diffusion model. At the bottom, we illustrate the attention loss on cross-attention maps with the segmentation mask. The core of the PPR module is also depicted on the right.
  • Figure 3: Visual comparison on single character condition generation.
  • Figure 4: Visualization of two-character image generation. The first two columns display two different reference character images. The middle four columns illustrate StoryMaker's ability for realistic synthesis. The last four columns demonstrate results of stylized synthesis, where the character embedding is set to zero.
  • Figure 5: Diverse applications of StoryMaker.