DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation
Shijian Ma, Yunqi Huang, Yan Lin
TL;DR
DramaBench introduces a six-dimensional framework for evaluating drama script continuation, addressing gaps in format fidelity, narrative progression, character voice, emotional depth, logical coherence, and conflict handling. It uses a novel LLM labeling plus statistical analysis pipeline to produce reproducible, interpretable metrics rather than opaque scores. The benchmark comprises 1,103 professionally structured scripts with 8,824 evaluations across 8 state-of-the-art models and rigorous significance testing, validation, and ablation studies. The findings show no single model excels across all dimensions, underscoring the value of multi-dimensional evaluation for targeted improvement and the reliability of the approach.
Abstract
Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
