Table of Contents
Fetching ...

Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models

Hyeonho Jeong, Gihyun Kwon, Jong Chul Ye

TL;DR

The paper tackles generating coherent, multi-image storybooks from plain text without training data by combining an LLM-based prompt generator with a latent diffusion model and an iterative identity-injection module. It introduces a Textual Inversion–driven identity embedding and a cycle-based mechanism to consistently render the main character across scenes while preserving backgrounds. Quantitative and qualitative evaluations show superior coherency and background fidelity compared with semantic image editing baselines, supported by user studies. The work enables training-free, controllable storybook generation with practical implications for automated storytelling and content creation.

Abstract

Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.

Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models

TL;DR

The paper tackles generating coherent, multi-image storybooks from plain text without training data by combining an LLM-based prompt generator with a latent diffusion model and an iterative identity-injection module. It introduces a Textual Inversion–driven identity embedding and a cycle-based mechanism to consistently render the main character across scenes while preserving backgrounds. Quantitative and qualitative evaluations show superior coherency and background fidelity compared with semantic image editing baselines, supported by user studies. The work enables training-free, controllable storybook generation with practical implications for automated storytelling and content creation.

Abstract

Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.
Paper Structure (21 sections, 8 equations, 12 figures, 1 table, 1 algorithm)

This paper contains 21 sections, 8 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Zero-shot generation example of a coherent storybook with the main character, Brad (top) and Hanna (bottom), using plain text story from 'The Lazy John'. All the generation processes are performed using large language models (LLM) and latent diffusion models without additional training data.
  • Figure 2: (a) Overall prompt generation process. (b) Text-to-Image generation results on the corresponding text sets: it can be observed that the generated images in the lower rows more effectively depict the semantics of the corresponding texts.
  • Figure 3: Iterative Coherent Identity Injection procedure.
  • Figure 4: Text-to-image generation with different storybook style modifiers.
  • Figure 5: Comparison with semantic image editing baselines. Our method effectively maintains the identity of a character across multiple images while preserving the background of the source images. Note that our method consistently preserves the emotional expressions from the source images, unlike other methods.
  • ...and 7 more figures