ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang
TL;DR
ViStoryBench addresses the need for a comprehensive, multi-shot benchmark for story visualization by compiling 80 diverse stories with character references and script annotations, and by introducing 12 automated metrics (including CIDS, Style Similarity, Prompt Alignment, OCCM, and Copy-Paste Detection) validated through human studies. It combines LLM-assisted data construction with manual verification and offers ViStoryBench-Lite for cost-effective evaluation. The study benchmarks over 30 methods (including open-source and commercial tools) and reveals a trade-off between narrative alignment and visual fidelity, highlighting strengths of multi-modal LLMs in semantic coherence and the superior aesthetics of commercial tools, while video-based approaches face temporal-consistency challenges. The results motivate hybrid approaches that fuse planning capabilities of LLMs with diffusion-based visual quality and advocate for richer, background-aware and temporal evaluation in future benchmarks.
Abstract
Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, lacking character references, or single-image cases, and fail to capture real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present \textbf{ViStoryBench}, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs human-verified to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt alignment, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.
