Table of Contents
Fetching ...

The Effect of Document Summarization on LLM-Based Relevance Judgments

Samaneh Mohtadi, Kevin Roitero, Stefano Mizzaro, Gianluca Demartini

TL;DR

This paper examines how document summarization affects LLM-based relevance judgments in IR evaluation by comparing full-document inputs to LLM-generated summaries (Summ-80, Summ-120) across three datasets (DL-19, DL-20, RAG-24) using GPT-4o and Llama-3.1-8B-Instruct. It analyzes agreement with human labels, retrieval effectiveness, and ranking stability, and assesses the cost implications of semantic compression. The results show GPT-4o maintains strong alignment and stable rankings with summaries, especially at 80 tokens, while LLaMA is more sensitive to compression; substantial token-cost savings are realized in large collections like RAG-24. Overall, summarization emerges as a scalable, cost-efficient alternative for automatic judgments, with model- and dataset-specific biases to consider for reliability.

Abstract

Relevance judgments are central to the evaluation of Information Retrieval (IR) systems, but obtaining them from human annotators is costly and time-consuming. Large Language Models (LLMs) have recently been proposed as automated assessors, showing promising alignment with human annotations. Most prior studies have treated documents as fixed units, feeding their full content directly to LLM assessors. We investigate how text summarization affects the reliability of LLM-based judgments and their downstream impact on IR evaluation. Using state-of-the-art LLMs across multiple TREC collections, we compare judgments made from full documents with those based on LLM-generated summaries of different lengths. We examine their agreement with human labels, their effect on retrieval effectiveness evaluation, and their influence on IR systems' ranking stability. Our findings show that summary-based judgments achieve comparable stability in systems' ranking to full-document judgments, while introducing systematic shifts in label distributions and biases that vary by model and dataset. These results highlight summarization as both an opportunity for more efficient large-scale IR evaluation and a methodological choice with important implications for the reliability of automatic judgments.

The Effect of Document Summarization on LLM-Based Relevance Judgments

TL;DR

This paper examines how document summarization affects LLM-based relevance judgments in IR evaluation by comparing full-document inputs to LLM-generated summaries (Summ-80, Summ-120) across three datasets (DL-19, DL-20, RAG-24) using GPT-4o and Llama-3.1-8B-Instruct. It analyzes agreement with human labels, retrieval effectiveness, and ranking stability, and assesses the cost implications of semantic compression. The results show GPT-4o maintains strong alignment and stable rankings with summaries, especially at 80 tokens, while LLaMA is more sensitive to compression; substantial token-cost savings are realized in large collections like RAG-24. Overall, summarization emerges as a scalable, cost-efficient alternative for automatic judgments, with model- and dataset-specific biases to consider for reliability.

Abstract

Relevance judgments are central to the evaluation of Information Retrieval (IR) systems, but obtaining them from human annotators is costly and time-consuming. Large Language Models (LLMs) have recently been proposed as automated assessors, showing promising alignment with human annotations. Most prior studies have treated documents as fixed units, feeding their full content directly to LLM assessors. We investigate how text summarization affects the reliability of LLM-based judgments and their downstream impact on IR evaluation. Using state-of-the-art LLMs across multiple TREC collections, we compare judgments made from full documents with those based on LLM-generated summaries of different lengths. We examine their agreement with human labels, their effect on retrieval effectiveness evaluation, and their influence on IR systems' ranking stability. Our findings show that summary-based judgments achieve comparable stability in systems' ranking to full-document judgments, while introducing systematic shifts in label distributions and biases that vary by model and dataset. These results highlight summarization as both an opportunity for more efficient large-scale IR evaluation and a methodological choice with important implications for the reliability of automatic judgments.

Paper Structure

This paper contains 16 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Prompt for generating summaries.
  • Figure 2: Scatter plots of Human- vs. LLM-derived retrieval effectiveness (NDCG@10) for DL-19 and DL-20.
  • Figure 3: Scatter plots of Human- vs. LLM-derived retrieval effectiveness (MAP).
  • Figure 4: Kendall’s $\tau$ (computed on nDCG@10 system scores) with 95% bootstrap confidence intervals across judging modalities.