Table of Contents
Fetching ...

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang

TL;DR

ViStoryBench addresses the need for a comprehensive, multi-shot benchmark for story visualization by compiling 80 diverse stories with character references and script annotations, and by introducing 12 automated metrics (including CIDS, Style Similarity, Prompt Alignment, OCCM, and Copy-Paste Detection) validated through human studies. It combines LLM-assisted data construction with manual verification and offers ViStoryBench-Lite for cost-effective evaluation. The study benchmarks over 30 methods (including open-source and commercial tools) and reveals a trade-off between narrative alignment and visual fidelity, highlighting strengths of multi-modal LLMs in semantic coherence and the superior aesthetics of commercial tools, while video-based approaches face temporal-consistency challenges. The results motivate hybrid approaches that fuse planning capabilities of LLMs with diffusion-based visual quality and advocate for richer, background-aware and temporal evaluation in future benchmarks.

Abstract

Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, lacking character references, or single-image cases, and fail to capture real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present \textbf{ViStoryBench}, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs human-verified to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt alignment, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.

ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

TL;DR

ViStoryBench addresses the need for a comprehensive, multi-shot benchmark for story visualization by compiling 80 diverse stories with character references and script annotations, and by introducing 12 automated metrics (including CIDS, Style Similarity, Prompt Alignment, OCCM, and Copy-Paste Detection) validated through human studies. It combines LLM-assisted data construction with manual verification and offers ViStoryBench-Lite for cost-effective evaluation. The study benchmarks over 30 methods (including open-source and commercial tools) and reveals a trade-off between narrative alignment and visual fidelity, highlighting strengths of multi-modal LLMs in semantic coherence and the superior aesthetics of commercial tools, while video-based approaches face temporal-consistency challenges. The results motivate hybrid approaches that fuse planning capabilities of LLMs with diffusion-based visual quality and advocate for richer, background-aware and temporal evaluation in future benchmarks.

Abstract

Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, lacking character references, or single-image cases, and fail to capture real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present \textbf{ViStoryBench}, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs human-verified to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt alignment, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.

Paper Structure

This paper contains 109 sections, 4 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Overview of ViStoryBench. : We propose a structured prompt engineering for data construction, which employs 5 strategies to convert an LLM into a controllable visual narrative script generator. Generated descriptions are proofread by humans to ensure reasonableness and character reference images are collected manually. : Based on this dataset, we develop evaluation metrics focusing on character and style similarity, along with multi-grained prompt alignment. A comprehensive evaluation is conducted using a hybrid framework combining expert models and VLMs.
  • Figure 2: ViStoryBench Dataset Statistics Overview. (a) Distribution of story style and cultural origin of our dataset. (b) Theme distribution comparison of Full and Lite subset, showing high statistical similarity. (c) Distribution of reference characters number per story. See Appendix \ref{['sec:dataset']}, \ref{['sec:lite']} for details.
  • Figure 3: Case Study of Failure in Story Visualization. : Alignment of character interaction, individual actions, camera design, and scene setting descriptions. : Whether the number of characters in the generated shots matches the script. : Quantifies the tendency to replicate reference images rather than generating novel instances. : Style and identity consistency of generated characters with reference images and across shots.
  • Figure 4: Character Identification Similarity (CIDS) Metric. Evaluating both cross-similarity and self-consistency by detecting and cropping character regions from reference and generated images, then computing cosine similarity between matched character features.
  • Figure 5: Prompt Alignment Evaluation. Based on the shot descriptions, Scene Score, Camera Score, Character Interaction Score and Individual Action are evaluated via GPT-4.1 GPT4.1 for best evaluation accuracy and Qwen3-VL yang2025qwen3 for reproducibility.
  • ...and 10 more figures