Table of Contents
Fetching ...

Generating Storytelling Images with Rich Chains-of-Reasoning

Xiujie Song, Qi Jia, Shota Watanabe, Xiaoyi Pang, Ruijie Chen, Mengyue Wu, Kenny Q. Zhu

TL;DR

The paper formalizes Storytelling Image Generation to produce a single image that encodes a semantically rich story through Chains-of-Reasoning. It introduces StorytellingPainter, a two-stage pipeline that uses an LLM Storyteller to craft a concise story and a T2I Painter to render the image, with Naive and CoR-Guided prompting modes. An evaluation framework with Semantic Complexity, KNN-Based Diversity, and Story-Image Alignment evaluators underpins rigorous assessment, and Mini-Storyteller models (SFT and DPO) demonstrate how lightweight open-source LLMs can approach proprietary-model performance. Experimental results show CoR-Guided prompts and higher-quality Painters improve semantic depth and alignment, while Mini-Storyteller training narrows the gap, enabling practical deployment for cognitive assessment, illustration, and multimodal reasoning research.

Abstract

An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.

Generating Storytelling Images with Rich Chains-of-Reasoning

TL;DR

The paper formalizes Storytelling Image Generation to produce a single image that encodes a semantically rich story through Chains-of-Reasoning. It introduces StorytellingPainter, a two-stage pipeline that uses an LLM Storyteller to craft a concise story and a T2I Painter to render the image, with Naive and CoR-Guided prompting modes. An evaluation framework with Semantic Complexity, KNN-Based Diversity, and Story-Image Alignment evaluators underpins rigorous assessment, and Mini-Storyteller models (SFT and DPO) demonstrate how lightweight open-source LLMs can approach proprietary-model performance. Experimental results show CoR-Guided prompts and higher-quality Painters improve semantic depth and alignment, while Mini-Storyteller training narrows the gap, enabling practical deployment for cognitive assessment, illustration, and multimodal reasoning research.

Abstract

An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.

Paper Structure

This paper contains 34 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Illustrative cases of Storytelling Images with rich CoRs. Each case consists of the original image, a graph formed by the connected CoRs within the image, and a textual description. (a) Cookie Theft, a widely used image in cognitive and linguistic assessments from Boston Diagnostic Aphasia Examination. (b) Frustration, a magazine cover illustration by Arthur Saron Sarnoff.
  • Figure 2: Our proposed StorytellingPainter pipeline and dedicated evaluators.
  • Figure 3: Prompt for the summarizer in KNN-based Diversity Evaluator.
  • Figure 4: Prompt for the Alignment Scoring stage in Story-Image Alignment Evaluator.
  • Figure 5: Distribution of visual clues across the seven dimensions in images generated by the StorytellingPainter pipeline with different Storyteller models. The Painter model is fixed as GPT-Image-1.
  • ...and 4 more figures