Table of Contents
Fetching ...

DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Shijian Ma, Yunqi Huang, Yan Lin

TL;DR

DramaBench introduces a six-dimensional framework for evaluating drama script continuation, addressing gaps in format fidelity, narrative progression, character voice, emotional depth, logical coherence, and conflict handling. It uses a novel LLM labeling plus statistical analysis pipeline to produce reproducible, interpretable metrics rather than opaque scores. The benchmark comprises 1,103 professionally structured scripts with 8,824 evaluations across 8 state-of-the-art models and rigorous significance testing, validation, and ablation studies. The findings show no single model excels across all dimensions, underscoring the value of multi-dimensional evaluation for targeted improvement and the reliability of the approach.

Abstract

Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

TL;DR

DramaBench introduces a six-dimensional framework for evaluating drama script continuation, addressing gaps in format fidelity, narrative progression, character voice, emotional depth, logical coherence, and conflict handling. It uses a novel LLM labeling plus statistical analysis pipeline to produce reproducible, interpretable metrics rather than opaque scores. The benchmark comprises 1,103 professionally structured scripts with 8,824 evaluations across 8 state-of-the-art models and rigorous significance testing, validation, and ablation studies. The findings show no single model excels across all dimensions, underscoring the value of multi-dimensional evaluation for targeted improvement and the reliability of the approach.

Abstract

Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.

Paper Structure

This paper contains 51 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the DramaBench evaluation framework. The pipeline consists of three components: (1) Task input with script context and model continuation, (2) Six independent evaluation dimensions (Format, Narrative, Character, Emotion, Logic, Conflict), and (3) Structured LLM labeling framework that extracts categorical labels (not direct scores) which are then aggregated into objective metrics. This approach ensures reproducibility and provides actionable feedback for model improvement.
  • Figure 2: Model performance comparison across six dimensions. All 8 SOTA models displayed on a single radar chart, revealing distinct capability profiles: GPT-5.2 shows balanced excellence across all dimensions, Qwen3-Max specializes in Emotional Depth, while Gemini 3 Pro excels in Conflict Handling. No single model dominates all dimensions.
  • Figure 3: Top 10 error types from the error taxonomy (10,850 total errors). Dialogue-Action Imbalance and Low Information Gain are the most common.
  • Figure 4: Spearman correlation matrix (5$\times$5) between content dimensions. Near-zero correlations (mean $|r| = 0.014$) confirm that each dimension captures independent quality aspects. Format Standards excluded due to 100% compliance across all models.