Table of Contents
Fetching ...

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei

TL;DR

Large language models are increasingly used as evaluators for subjective text generation, but reliability remains limited without access to human reasoning. The authors propose a human-LLM collaborative framework to infer thinking traces from label-only annotations via rejection sampling, producing a reasoning-rich dataset (D_reason) that augments training and prompting. They demonstrate two complementary applications: fine-tuning open LLM raters with thinking-trace data and automatically refining annotation codebooks for proprietary models, improving alignment with human judgments and cross-model consistency across diverse tasks. The approach enables scaling label-only corpora into thinking-trace-augmented resources, offering a practical pathway to more reliable and interpretable LLM-based evaluation in real-world settings.

Abstract

Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

TL;DR

Large language models are increasingly used as evaluators for subjective text generation, but reliability remains limited without access to human reasoning. The authors propose a human-LLM collaborative framework to infer thinking traces from label-only annotations via rejection sampling, producing a reasoning-rich dataset (D_reason) that augments training and prompting. They demonstrate two complementary applications: fine-tuning open LLM raters with thinking-trace data and automatically refining annotation codebooks for proprietary models, improving alignment with human judgments and cross-model consistency across diverse tasks. The approach enables scaling label-only corpora into thinking-trace-augmented resources, offering a practical pathway to more reliable and interpretable LLM-based evaluation in real-world settings.

Abstract

Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.

Paper Structure

This paper contains 56 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Original Annotation Codebook (Evaluating Complexity of Short Stories) chhun2022human. Only very basic instructions and scoring rubrics are provided.
  • Figure 2: An example of the inferred thinking trace in evaluating the engagement level of a short story.
  • Figure 3: Illustration of Inferring Thinking Traces through an RLM. Details are provided in Section \ref{['sec:infer_cot']}.
  • Figure 4: Refined Annotation Codebook. Part of the content is omitted due to space limitations. See Appendix \ref{['sec:codebook_examples']} for a complete example.
  • Figure 5: Example thinking traces from two LLM raters on evaluating the complexity of a short story, before and after codebook refinement.