WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
Leon Lin, Jun Zheng, Haidong Wang
TL;DR
WebNovelBench presents a scalable benchmark for long-form Chinese web-novel generation by framing evaluation as a synopsis-to-story task and employing an automated LLM-as-Judge across eight narrative dimensions. The framework uses a 4,000+ web-novel corpus to build a distribution of human-like quality via PCA weighting and ECDF-based percentile ranking, enabling objective comparisons against 24 state-of-the-art LLMs. Experimental results show the benchmark can distinguish classics, popular web fiction, and model-generated narratives, while revealing strengths and weaknesses across models and highlighting the impact of automated, data-driven evaluation. The work offers a practical, extensible methodology for advancing LLM-driven narrative generation and provides insights to guide future development in long-form storytelling systems.
Abstract
Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.
