Table of Contents
Fetching ...

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu, Wenbiao Ding, Xiaoxi Mao, Changjie Fan, Minlie Huang

TL;DR

OpenMEVA targets a key gap in evaluating open-ended story generation by providing a benchmark that jointly assesses correlation with human judgments, generalization across models and datasets, discourse-aware coherence, and robustness to perturbations. It combines a manually annotated set (mans) created from ROCStories and WritingPrompts with an auto-constructed set (autos) that isolates specific aspects via perturbations, complemented by an open-source toolkit for metric evaluation and data augmentation. The study shows that existing metrics still exhibit weak alignment with human judgments (often below $<0.5$ on mans), fail to capture discourse-level incoherence and inferential knowledge, and struggle to generalize or withstand perturbations, underscoring the need for improved evaluation methods. By standardizing test cases and providing perturbation-based checks, OpenMEVA enables fairer metric comparisons and accelerates progress in developing robust evaluation methods for long-form narrative generation.

Abstract

Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

TL;DR

OpenMEVA targets a key gap in evaluating open-ended story generation by providing a benchmark that jointly assesses correlation with human judgments, generalization across models and datasets, discourse-aware coherence, and robustness to perturbations. It combines a manually annotated set (mans) created from ROCStories and WritingPrompts with an auto-constructed set (autos) that isolates specific aspects via perturbations, complemented by an open-source toolkit for metric evaluation and data augmentation. The study shows that existing metrics still exhibit weak alignment with human judgments (often below on mans), fail to capture discourse-level incoherence and inferential knowledge, and struggle to generalize or withstand perturbations, underscoring the need for improved evaluation methods. By standardizing test cases and providing perturbation-based checks, OpenMEVA enables fairer metric comparisons and accelerates progress in developing robust evaluation methods for long-form narrative generation.

Abstract

Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.

Paper Structure

This paper contains 28 sections, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Overview for the manual annotation interface. Story A gets two points in overall quality since it gets three points deducted for its repetitive plot and chaotic scene. The ratings of Annotator #5 for the current story group are rejected because of the low score for the human-written story and the high score for the negative sample.
  • Figure 2: Correlation between human judgment difference (x-axis) and metric score difference (y-axis). Top: ROC, Bottom: WP. We only show the situation in the positive x-axis, since it is centrosymmetric with that in the negative x-axis. Human (S)/Metric (S) means the difference of human judgment/metric score is significant (p$<$0.01, t-test), while (NS) means insignificant difference. $r^2$ is the coefficient of determination for linear regression (red line), and is exactly the square of the Pearson correlation coefficient between the x-axis and y-axis.
  • Figure 3: Boxplot of human judgments for each story source (Top: ROC, Bottom: WP).