Table of Contents
Fetching ...

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Brihi Joshi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang

TL;DR

CoKe introduces a chain-of-keywords approach to fine-grained story evaluation, where a rationalizer first generates keyword sequences before free-text rationales to guide a scorer in predicting rating scores. By sampling multiple keyword sequences ($\\mathcal{N}$) and averaging the resulting scores, CoKe models annotator diversity and better approximates population averages. On the StoryER dataset, CoKe with modest-sized, fine-tuned models achieves human-level performance and significantly outperforms GPT-4 in correlation, while using orders of magnitude fewer parameters. The method enhances interpretability through keyword-based explanations and offers a flexible, customizable evaluation framework suitable for audience-specific ratings and downstream applications.

Abstract

Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $\textbf{C}$hain-$\textbf{o}$f-$\textbf{Ke}$ywords (CoKe), that generates a sequence of keywords $\textit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

TL;DR

CoKe introduces a chain-of-keywords approach to fine-grained story evaluation, where a rationalizer first generates keyword sequences before free-text rationales to guide a scorer in predicting rating scores. By sampling multiple keyword sequences () and averaging the resulting scores, CoKe models annotator diversity and better approximates population averages. On the StoryER dataset, CoKe with modest-sized, fine-tuned models achieves human-level performance and significantly outperforms GPT-4 in correlation, while using orders of magnitude fewer parameters. The method enhances interpretability through keyword-based explanations and offers a flexible, customizable evaluation framework suitable for audience-specific ratings and downstream applications.

Abstract

Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose hain-f-ywords (CoKe), that generates a sequence of keywords generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

Paper Structure

This paper contains 31 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: CoKe provides a low-cost, audience-oriented (customizable), and keyword-guided approach to evaluating stories by generating and scoring diverse keyword sequences that explain a fine-grained aspect-story pair.
  • Figure 2: ICC annotator agreements scores for the stories with a certain aspect in the training set.
  • Figure 3: During training, CoKe extracts keywords from annotator explanations and train rationalizers and scorers. During inference, CoKe first samples candidate keyword sequences (for the scorer) and explanations (for better interpretability), and then score the individual generated candidates before aggregating them. Our purpose is to obtain a better population average that can capture diverse annotator scores.
  • Figure 4: Pearson's $\rho$ increases with the larger number of candidate generations ($\mathcal{N}$) in CoKe and it's ablations. The rationalizer model here is T5-3b. We note that increasing the diversity of generation helps with better estimation of population preferences.
  • Figure 5: Suppose we want to understand the prediction rating of the heartwarming/touch aspect for a stroy, we can visualize the generated keywords in all of the generated samples. The x-axis plots the average rating of the keyword for this story, and the y-axis plots the global rating of the keyword averaged across the training set. The size of the keyword proportional to its frequency in the generated keyword sequences.
  • ...and 5 more figures