Table of Contents
Fetching ...

CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation

Mingzhe Zheng, Dingjie Song, Guanyu Zhou, Jun You, Jiahao Zhan, Xuran Ma, Xinyuan Song, Ser-Nam Lim, Qifeng Chen, Harry Yang

TL;DR

The paper tackles the gap between LLM-generated movie scripts and the nuanced storytelling demanded by cinema. It introduces CML-Dataset to ground evaluation in authentic, high-quality segments and identifies three key dimensions—Dialogue Coherence, Character Consistency, and Plot Reasonableness—as foundational screenplay quality metrics. Building on these insights, it presents CML-Bench, a nine-metric, cinema-focused evaluation framework, and CML-Instruction, a structured prompting strategy that significantly improves LLM-script quality and aligns closely with human judgments. Through diverse experiments, case studies, and human evaluations, the approach demonstrates robust improvements across genres and establishes a practical, interpretable pipeline for automated screenplay generation and assessment.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the 'soul' of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where 'content' consists of segments from esteemed, high-quality movie scripts and 'summary' is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.

CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation

TL;DR

The paper tackles the gap between LLM-generated movie scripts and the nuanced storytelling demanded by cinema. It introduces CML-Dataset to ground evaluation in authentic, high-quality segments and identifies three key dimensions—Dialogue Coherence, Character Consistency, and Plot Reasonableness—as foundational screenplay quality metrics. Building on these insights, it presents CML-Bench, a nine-metric, cinema-focused evaluation framework, and CML-Instruction, a structured prompting strategy that significantly improves LLM-script quality and aligns closely with human judgments. Through diverse experiments, case studies, and human evaluations, the approach demonstrates robust improvements across genres and establishes a practical, interpretable pipeline for automated screenplay generation and assessment.

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the 'soul' of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where 'content' consists of segments from esteemed, high-quality movie scripts and 'summary' is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.

Paper Structure

This paper contains 53 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: CML-Dataset construction pipeline: Raw movie script elements (e.g., stage direction, scene description, character, dialogue) are extracted and transformed into a structured CML-Dataset entry containing basic information, the script segment (content), and an LLM-generated summary.
  • Figure 2: Genre distribution in CML-Dataset. The dataset includes all major movie types.
  • Figure 3: Intrinsic Characteristics Analysis.
  • Figure 4: CML-Bench evaluation of LLM-generated screenplays. (a) Human (GT) vs. LLMs with base prompt settings. (b) Human (GT) vs. LLMs with CML-Instruction.
  • Figure 5: Case studies on diverse movie genres
  • ...and 4 more figures