Table of Contents
Fetching ...

WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

Leon Lin, Jun Zheng, Haidong Wang

TL;DR

WebNovelBench presents a scalable benchmark for long-form Chinese web-novel generation by framing evaluation as a synopsis-to-story task and employing an automated LLM-as-Judge across eight narrative dimensions. The framework uses a 4,000+ web-novel corpus to build a distribution of human-like quality via PCA weighting and ECDF-based percentile ranking, enabling objective comparisons against 24 state-of-the-art LLMs. Experimental results show the benchmark can distinguish classics, popular web fiction, and model-generated narratives, while revealing strengths and weaknesses across models and highlighting the impact of automated, data-driven evaluation. The work offers a practical, extensible methodology for advancing LLM-driven narrative generation and provides insights to guide future development in long-form storytelling systems.

Abstract

Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

TL;DR

WebNovelBench presents a scalable benchmark for long-form Chinese web-novel generation by framing evaluation as a synopsis-to-story task and employing an automated LLM-as-Judge across eight narrative dimensions. The framework uses a 4,000+ web-novel corpus to build a distribution of human-like quality via PCA weighting and ECDF-based percentile ranking, enabling objective comparisons against 24 state-of-the-art LLMs. Experimental results show the benchmark can distinguish classics, popular web fiction, and model-generated narratives, while revealing strengths and weaknesses across models and highlighting the impact of automated, data-driven evaluation. The work offers a practical, extensible methodology for advancing LLM-driven narrative generation and provides insights to guide future development in long-form storytelling systems.

Abstract

Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

Paper Structure

This paper contains 24 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Web Novel Dataset Distribution and LLM Placement. Our web novel dataset's quality distribution, with Low, Medium, and High zones (95% CIs). The red curve (classic literary works) validates the high-quality zone. Positions of 24 LLMs indicate their performance relative to this corpus.
  • Figure 2: Framework of Our Method. Our benchmark framework consists of four major components: (1) Data Preparation Phase: We collect and curate a large web novel dataset, and use Doubao for story-to-synopsis extraction to build a 4,000 novels synopsis-to-story dataset. (2) Distribution Construction: Each story is scored across eight quality dimensions using LLM-as-judge, followed by PCA+ECDF to form a quality distribution benchmark. Classic literary works are used to validate the high end of the distribution. (3) Model Evaluation: LLMs generate stories from selected subsets of the dataset. Their outputs are scored and mapped onto the distribution to assess model performance. (4) Ad Hoc Evaluation: New data can be scored and aligned with the benchmark for measuring data quality and supporting further applications.
  • Figure 3: LLM Performance Heatmap Across Narrative Dimensions. Shows average scores (1-5 scale) for 24 LLMs on eight dimensions, sorted by percentile rank. Final column is PCA-derived weighted norm score. Higher scores indicate better alignment with quality human writing.
  • Figure 4: Distributions of Narrative Metrics and Fitted Normal Curves. Each subplot shows the empirical distribution (solid line) of a narrative evaluation dimension across the web novel dataset, alongside the corresponding fitted normal distribution (dashed line). The comparison illustrates the varying shapes of real-world data and highlights where distributions deviate from normality. The bottom-right panel presents the overall distribution of mean scores.
  • Figure 5: Robustness Assessment of LLM-as-Judge. Boxplot of normalized scores for selected classic Chinese novels, based on 11 repeated evaluations using the LLM-as-judge framework. Each box shows the interquartile range (IQR) with the median (solid line) and mean (dashed line) marked. The majority of works demonstrate consistently high scores with narrow IQRs and minimal outliers, indicating the robustness and stability of model evaluations.
  • ...and 5 more figures