TaleCrafter: Interactive Story Visualization with Multiple Characters

Yuan Gong; Youxin Pang; Xiaodong Cun; Menghan Xia; Yingqing He; Haoxin Chen; Longyue Wang; Yong Zhang; Xintao Wang; Ying Shan; Yujiu Yang

TaleCrafter: Interactive Story Visualization with Multiple Characters

Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang

TL;DR

This work presents TaleCrafter, a four-component pipeline for generic interactive story visualization that supports multiple novel characters and editable layouts. It combines S2P (GPT-4–driven prompts), T2L (discrete-diffusion layout generation), C-T2I (controllable diffusion-based image synthesis with LoRA-based identity preservation and local-structure control), and I2V (depth-aware video animation) to ensure identity consistency, text–image alignment, and flexible object layouts across frames. The approach addresses generalization to new characters and scenes while enabling interactive editing and multi-modal guidance, outperforming baselines in both qualitative and quantitative evaluations and validated by a user study. The system enables zero-shot storytelling with novel characters and scenes and provides practical tools for editing layouts and local structures, expanding the scope of machine-assisted visual storytelling in education and entertainment.

Abstract

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images. Most previous works endeavor to meet these requirements by fitting a text-to-image (T2I) model on a set of videos in the same style and with the same characters, e.g., the FlintstonesSV dataset. However, the learned T2I models typically struggle to adapt to new characters, scenes, and styles, and often lack the flexibility to revise the layout of the synthesized images. This paper proposes a system for generic interactive story visualization, capable of handling multiple novel characters and supporting the editing of layout and local structure. It is developed by leveraging the prior knowledge of large language and T2I models, trained on massive corpora. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-layout generation (T2L), controllable text-to-image generation (C-T2I), and image-to-video animation (I2V). First, the S2P module converts concise story information into detailed prompts required for subsequent stages. Next, T2L generates diverse and reasonable layouts based on the prompts, offering users the ability to adjust and refine the layout to their preference. The core component, C-T2I, enables the creation of images guided by layouts, sketches, and actor-specific identifiers to maintain consistency and detail across visualizations. Finally, I2V enriches the visualization process by animating the generated images. Extensive experiments and a user study are conducted to validate the effectiveness and flexibility of interactive editing of the proposed system.

TaleCrafter: Interactive Story Visualization with Multiple Characters

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 6 figures, 2 tables)

This paper contains 28 sections, 6 equations, 6 figures, 2 tables.

Introduction
Related Work
Story Visualization
Text-to-image Generation
Method
Story-to-prompt Generation
Text-to-layout Generation
Controllable Text-to-image Generation
Identity Preservation.
Object Localization.
Local Structure Control.
Iterative Generation
Training Objective
Image-to-video Generation
Experiments
...and 13 more sections

Figures (6)

Figure 1: The pipeline of our interactive story visualization system. The system comprises four components. (a) Story-to-prompt (S2P): a large language model is utilized to bridge the gap between the literary and artistic descriptions and the descriptions fed into T2I models. It comprehends the content in the given story and converts it into prompts suitable for T2I models, following the given instructions. (b) Text-to-layout (T2L): generates a reasonable layout for the main subjects in the prompt. (c) Controllable text-to-image (C-T2I): given various conditions such as prompt, layout, sketch, and a few images of each character, generates consistent-character images. It enables interactive editing of character, layout, and local structure through sketches. (d) Image-to-video (I2V): extracts depth from the image and converts it into a video by setting the camera path for novel view synthesis.
Figure 2: The structure of the C-T2I component. It takes a noisy image as input and generates an image through a single denoising step, conditioning on multiple types of guidance, including prompt, sketch, and bounding box with description. For identity consistency, we use LoRA to learn the personalized weights in self and cross-attention layers as well as a specific token for each character.
Figure 3: Comparison with Custom-Diffusion and Paint-by-Example. One character is an Anime character and the other is a real cat. The style is specified by "Ghibli" for all the three methods.
Figure 4: Comparisons with Make-a-Story, Custom-Diffusion, and Paint-by-Example using characters from the FlintsonesSV dataset.
Figure 5: Comparison with GLIGEN on identity control.
...and 1 more figures

TaleCrafter: Interactive Story Visualization with Multiple Characters

TL;DR

Abstract

TaleCrafter: Interactive Story Visualization with Multiple Characters

Authors

TL;DR

Abstract

Table of Contents

Figures (6)