Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Qian Cao; Xiting Wang; Yuzhuo Yuan; Yahui Liu; Fang Luo; Ruihua Song

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song

TL;DR

This work tackles the problem of evaluating textual creativity across diverse domains, a task hindered by domain-specific benchmarks and costly human judgments. It introduces CreataSet, a large-scale, cross-domain dataset, and CrEval, an LLM-based evaluator trained with a weakly supervised, mixed-label framework that leverages both human and synthetic data. CrEval consistently outperforms baselines and demonstrates cross-domain generalization, while enabling improvements in model creativity through data-driven generation and evaluation loops. The approach offers a scalable, human-aligned path to advancing creativity evaluation and generation in large language models, with data, code, and models slated for public release.

Abstract

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

TL;DR

Abstract

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)