Table of Contents
Fetching ...

DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, Jian Yin

TL;DR

DreamStory addresses open-domain story visualization by combining an LLM-driven director with a training-free Multi-Subject Consistent Diffusion (MSD) that enforces subject-level consistency across frames via MMSA and MMCA. It generates scene- and subject-level prompts, portraits as multimodal anchors, and uses open-vocabulary segmentation to obtain subject masks. The DS-500 benchmark validates improvements in aesthetics, image-text alignment, and multi-subject consistency against state-of-the-art. Results show training-free MSD outperforms baselines without finetuning, with robust performance across LLMs and segmentation models. This work enables scalable, open-domain visualization of narratives with multiple characters.

Abstract

Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene's subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA and MMCA modules ensure appearance and semantic consistency with reference images and text, respectively. Both modules employ masking mechanisms to prevent subject blending. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations. Please visit our project homepage at https://dream-xyz.github.io/dreamstory.

DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion

TL;DR

DreamStory addresses open-domain story visualization by combining an LLM-driven director with a training-free Multi-Subject Consistent Diffusion (MSD) that enforces subject-level consistency across frames via MMSA and MMCA. It generates scene- and subject-level prompts, portraits as multimodal anchors, and uses open-vocabulary segmentation to obtain subject masks. The DS-500 benchmark validates improvements in aesthetics, image-text alignment, and multi-subject consistency against state-of-the-art. Results show training-free MSD outperforms baselines without finetuning, with robust performance across LLMs and segmentation models. This work enables scalable, open-domain visualization of narratives with multiple characters.

Abstract

Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene's subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA and MMCA modules ensure appearance and semantic consistency with reference images and text, respectively. Both modules employ masking mechanisms to prevent subject blending. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations. Please visit our project homepage at https://dream-xyz.github.io/dreamstory.
Paper Structure (38 sections, 5 equations, 18 figures, 6 tables)

This paper contains 38 sections, 5 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Illustration of our proposed DreamStory framework. This system takes a full narrative text as input, generates vivid visual content, and maintains the consistency of multiple subjects across various scenes within the story. Please visit the project https://dream-xyz.github.io/dreamstory to watch the video.
  • Figure 2: The framework of our proposed DreamStory. Initially, the LLM comprehends a story and generates detailed prompts for key subjects and scenes. These prompts are aligned and rewritten to enhance understanding of the diffusion model, ensuring accurate visual content generation. Subject portraits are then generated based on these prompts, serving as multimodal anchors for maintaining multi-subject consistency and enriching scenes with high-quality visual details, which facilitates subsequent video creation using an image-to-video model.
  • Figure 3: The illustration of our Multi-Subject consistent Diffusion models (MSD), along with its Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) mechanisms. It uses two subjects as examples and can be extended to any number of subjects. Query, Key, and Value projections in the attention layer have been omitted for ease of presentation.
  • Figure 4: Qualitative comparisons of our DreamStory with SOTA approaches on the FSS real story benchmark. Ours, MuDI, and ConsiStory utilize the subject image on the bottom-left as the reference image. In contrast, StoryDiffusion references the subject image on the bottom-right. Different subjects are indicated with different colors. Please visit the project https://dream-xyz.github.io/dreamstory to watch the video.
  • Figure 5: Qualitative comparisons of our DreamStory with SOTA approaches on the ChatGPT generated story benchmark. Ours, MuDI, and ConsiStory utilize the subject image on the bottom-left as the reference image. In contrast, StoryDiffusion references the subject image on the bottom-right. Different subjects are indicated with different colors. Please visit the project https://dream-xyz.github.io/dreamstory to watch the video.
  • ...and 13 more figures