Table of Contents
Fetching ...

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber

TL;DR

LitBench tackles the challenge of evaluating open-ended creative writing by providing a standardized benchmark with paired human-labeled comparisons and a large training corpus. It benchmarks zero-shot LLM judges against trained reward models (Bradley-Terry and Generative Reward Models), finding that trained verifiers outperform zero-shot judges, with Claude-3.7-Sonnet achieving 73% agreement and BT/GenRM reaching about 78%. The authors validate alignment with human preferences through online experiments on novel LLM-generated stories and release LitBench and reward models for public use. The results suggest that targeted preference-finetuning yields more reliable evaluation signals for creative writing than relying solely on large, off-the-shelf judges.

Abstract

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

TL;DR

LitBench tackles the challenge of evaluating open-ended creative writing by providing a standardized benchmark with paired human-labeled comparisons and a large training corpus. It benchmarks zero-shot LLM judges against trained reward models (Bradley-Terry and Generative Reward Models), finding that trained verifiers outperform zero-shot judges, with Claude-3.7-Sonnet achieving 73% agreement and BT/GenRM reaching about 78%. The authors validate alignment with human preferences through online experiments on novel LLM-generated stories and release LitBench and reward models for public use. The results suggest that targeted preference-finetuning yields more reliable evaluation signals for creative writing than relying solely on large, off-the-shelf judges.

Abstract

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

Paper Structure

This paper contains 28 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Preprocessing methodology for dataset creation.
  • Figure 2: Length bias mitigation.
  • Figure 3: Distributions of word count, date, and upvotes for the LitBench test- and train-set.
  • Figure 4: Trained verifiers outperform zero-shot LLM-judges on LitBench. Claude3.7-Sonnet is the strongest zero-shot model. BT verifiers are competitive with GenRMs, but GenRMs with CoTs perform worse. The sizes of Qwen, Llama and Gemma backbones are 7B, 8B and 12B, respectively.
  • Figure 5: Qualities of explanation text that impact verdict accuracy.
  • ...and 3 more figures