Table of Contents
Fetching ...

Topic-Specific Classifiers are Better Relevance Judges than Prompted LLMs

Lukas Gienapp, Martin Potthast, Harrisen Scells, Eugene Yang

Abstract

The unjudged document problem, where pooled test collections have incomplete relevance judgments for evaluating new retrieval systems, is a key obstacle to the reusability of test collections in information retrieval. While the de facto standard to deal with the problem is to treat unjudged documents as non-relevant, many alternatives have been proposed, including the use of large language models (LLMs) as a relevance judge (LLM-as-a-judge). However, this has been criticized as circular, since the same LLM can be used as a judge and as a ranker at the same time. We propose to train topic-specific relevance classifiers instead: By finetuning monoT5 with independent LoRA weight adaptation on the judgments of a single assessor for a single topic's pool, we align it to that assessor's notion of relevance for the topic. The system rankings obtained through our classifier's relevance judgments achieve a Spearmans' $ρ$ correlation of $>0.95$ with ground truth system rankings. As little as 128 initial human judgments per topic suffice to improve the comparability of models, compared to treating unjudged documents as non-relevant, while achieving more reliability than existing LLM-as-a-judge approaches. Topic-specific relevance classifiers thus are a lightweight and straightforward way to tackle the unjudged document problem, while maintaining human judgments as the gold standard for retrieval evaluation. Code, models, and data are made openly available.

Topic-Specific Classifiers are Better Relevance Judges than Prompted LLMs

Abstract

The unjudged document problem, where pooled test collections have incomplete relevance judgments for evaluating new retrieval systems, is a key obstacle to the reusability of test collections in information retrieval. While the de facto standard to deal with the problem is to treat unjudged documents as non-relevant, many alternatives have been proposed, including the use of large language models (LLMs) as a relevance judge (LLM-as-a-judge). However, this has been criticized as circular, since the same LLM can be used as a judge and as a ranker at the same time. We propose to train topic-specific relevance classifiers instead: By finetuning monoT5 with independent LoRA weight adaptation on the judgments of a single assessor for a single topic's pool, we align it to that assessor's notion of relevance for the topic. The system rankings obtained through our classifier's relevance judgments achieve a Spearmans' correlation of with ground truth system rankings. As little as 128 initial human judgments per topic suffice to improve the comparability of models, compared to treating unjudged documents as non-relevant, while achieving more reliability than existing LLM-as-a-judge approaches. Topic-specific relevance classifiers thus are a lightweight and straightforward way to tackle the unjudged document problem, while maintaining human judgments as the gold standard for retrieval evaluation. Code, models, and data are made openly available.

Paper Structure

This paper contains 29 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Effectiveness of our topic-specific ranker approach in terms of F$_1$, precision, recall, and accuracy when fine-tuning versus when using LoRA between different model architectures.
  • Figure 2: Precision, recall, F1, and Accuracy for binary relevance classification by adapter models using different amounts of training samples. Colors indicate datasets: TREC Robust04, TREC DL19, TREC DL20. DL dataset are in-domain (MSMARCO) for base ranking model, Robust04 is not.
  • Figure 3: Spearmans' $\rho$ rank correlation of system orderings at different depths of nDCG to human judgments at different run-subsampling rates for simulated pooling. Missing relevance judgments infilled by assuming non-relevance (baseline, dashed), and judge adapter relevancy predictions for $t=256$, $t=192$, $t=128$, $t=64$, where $t$ is adapter training set size. Bootstrapped uncertainty as shaded areas ($n=20$).