Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment
Vanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari
TL;DR
This work tackles the challenge of subjectively evaluating creativity by learning personalized judgments from multiple expert annotators. It introduces an Intrinsic Curiosity Model (ICM) that couples a forward belief-shift score with an expert-attribution signal, producing a curiosity signal used to condition a supervised fine-tuning model. Across model scales and a 5-fold cross-validation on the TTCW dataset, ICM consistently improves Pearson correlation, F1, and Cohen's kappa over baseline SFT methods and even outperforms GPT-5 in several settings, especially in out-of-distribution scenarios. The approach enables user-aligned, scalable evaluation of creative writing with potential extensions to other subjective domains and RL-based reward schemes.
Abstract
Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.
