CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Brihi Joshi; Sriram Venkatapathy; Mohit Bansal; Nanyun Peng; Haw-Shiuan Chang

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Brihi Joshi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang

TL;DR

CoKe introduces a chain-of-keywords approach to fine-grained story evaluation, where a rationalizer first generates keyword sequences before free-text rationales to guide a scorer in predicting rating scores. By sampling multiple keyword sequences ($\\mathcal{N}$) and averaging the resulting scores, CoKe models annotator diversity and better approximates population averages. On the StoryER dataset, CoKe with modest-sized, fine-tuned models achieves human-level performance and significantly outperforms GPT-4 in correlation, while using orders of magnitude fewer parameters. The method enhances interpretability through keyword-based explanations and offers a flexible, customizable evaluation framework suitable for audience-specific ratings and downstream applications.

Abstract

Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $\textbf{C}$hain-$\textbf{o}$f-$\textbf{Ke}$ywords (CoKe), that generates a sequence of keywords $\textit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

TL;DR

) and averaging the resulting scores, CoKe models annotator diversity and better approximates population averages. On the StoryER dataset, CoKe with modest-sized, fine-tuned models achieves human-level performance and significantly outperforms GPT-4 in correlation, while using orders of magnitude fewer parameters. The method enhances interpretability through keyword-based explanations and offers a flexible, customizable evaluation framework suitable for audience-specific ratings and downstream applications.

Abstract

hain-

ywords (CoKe), that generates a sequence of keywords

generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

TL;DR

Abstract

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)