Table of Contents
Fetching ...

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin

TL;DR

The paper investigates whether a strong LLM judge can substitute human judges for support evaluation in retrieval-augmented generation (RAG) by comparing GPT-4o and human judgments on the TREC 2024 RAG Track. It analyzes two evaluation setups—manual from scratch and manual with post-editing—across 45 participant submissions on 36 topics, using weighted precision and weighted recall as core metrics. The results show substantial agreement between GPT-4o and humans, with perfect agreement rising from 56% to 72.1% when post-editing is allowed, and high run-level correlations. An independent judge and an alternate LLM (LLAMA-3.1) further corroborate GPT-4o’s alignment, highlighting the viability of LLM judges for scalable support assessment and identifying areas for improving annotation reliability. Qualitative analyses of errors provide guidance for refining future support evaluation iterations.

Abstract

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

TL;DR

The paper investigates whether a strong LLM judge can substitute human judges for support evaluation in retrieval-augmented generation (RAG) by comparing GPT-4o and human judgments on the TREC 2024 RAG Track. It analyzes two evaluation setups—manual from scratch and manual with post-editing—across 45 participant submissions on 36 topics, using weighted precision and weighted recall as core metrics. The results show substantial agreement between GPT-4o and humans, with perfect agreement rising from 56% to 72.1% when post-editing is allowed, and high run-level correlations. An independent judge and an alternate LLM (LLAMA-3.1) further corroborate GPT-4o’s alignment, highlighting the viability of LLM judges for scalable support assessment and identifying areas for improving annotation reliability. Qualitative analyses of errors provide guidance for refining future support evaluation iterations.

Abstract

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Prompt used by the GPT-4o judge for support evaluation.
  • Figure 2: Correlations between scores from human and GPT-4o judges for the manual from-scratch condition (top) and the manual with post-editing condition (bottom), measuring weighted precision and recall. Red markers show run-level scores, yellow triangles show per-topic averages, and blue dots or green boxes show all individual topic/run combinations. Each plot is annotated with rank correlations showing Kendall's $\tau$.
  • Figure 3: Confusion matrices comparing predictions from human and GPT-4o judges for the manual from-scratch condition (left) and the manual with post-editing condition (right).
  • Figure 4: Inter-annotator agreement score (Cohen's $\kappa$) for our unbiased study on disagreements between GPT-4o and human annotators.
  • Figure 5: Support label prediction by different judges for each support category (FS, PS, NS) in the disagreement analysis on 537 sentence--passage pairs.