Table of Contents
Fetching ...

Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

Vanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari

TL;DR

This work tackles the challenge of subjectively evaluating creativity by learning personalized judgments from multiple expert annotators. It introduces an Intrinsic Curiosity Model (ICM) that couples a forward belief-shift score with an expert-attribution signal, producing a curiosity signal used to condition a supervised fine-tuning model. Across model scales and a 5-fold cross-validation on the TTCW dataset, ICM consistently improves Pearson correlation, F1, and Cohen's kappa over baseline SFT methods and even outperforms GPT-5 in several settings, especially in out-of-distribution scenarios. The approach enables user-aligned, scalable evaluation of creative writing with potential extensions to other subjective domains and RL-based reward schemes.

Abstract

Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.

Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

TL;DR

This work tackles the challenge of subjectively evaluating creativity by learning personalized judgments from multiple expert annotators. It introduces an Intrinsic Curiosity Model (ICM) that couples a forward belief-shift score with an expert-attribution signal, producing a curiosity signal used to condition a supervised fine-tuning model. Across model scales and a 5-fold cross-validation on the TTCW dataset, ICM consistently improves Pearson correlation, F1, and Cohen's kappa over baseline SFT methods and even outperforms GPT-5 in several settings, especially in out-of-distribution scenarios. The approach enables user-aligned, scalable evaluation of creative writing with potential extensions to other subjective domains and RL-based reward schemes.

Abstract

Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.

Paper Structure

This paper contains 36 sections, 13 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Overview of Architecture during training for Curiosity Driven LLM-as-a-judge
  • Figure 2: Overview of Architecture during inference for Curiosity Driven LLM-as-a-judge
  • Figure 3: Comparison of baselines with and without using explanations.
  • Figure 4: Three-way comparison across model sizes for ICM (ours), SFT baseline (classification, no explanations), and SFT baseline (with explanations). Panels show Pearson and F1 for in-distribution (top) and out-of-distribution (bottom). For exact results of the ID and OOD experiments of baseline without explanation(classification), refer to Table \ref{['tab:baseline_classification_expt_id']} and Table \ref{['tab:baseline_classification_expt_ood']}
  • Figure 5: Curiosity scores based on match and mismatch of predictions from Qwen-0.5B base non-finetuned model and the ground truth