Table of Contents
Fetching ...

Are Large Language Models Good at Utility Judgments?

Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng

TL;DR

This work investigates whether large language models can perform utility judgments on retrieved passages for open-domain QA, going beyond traditional relevance judgments. It introduces GTI and GTU benchmark settings, builds a diverse candidate-passage benchmark with ground-truth, counterfactual, and noisy passages, and evaluates five LLMs under various prompt forms. Key findings show that well-instructed models can distinguish utility from relevance, that utility-guided evidence often improves QA performance, and that a k-sampling listwise approach mitigates order sensitivity and boosts answer quality. The authors release code and benchmarks to advance understanding and design of retrieval-augmented systems that prioritize utility in evidence selection.

Abstract

Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering. In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain QA. Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our experiments reveal that: (i) well-instructed LLMs can distinguish between relevance and utility, and that LLMs are highly receptive to newly generated counterfactual passages. Moreover, (ii) we scrutinize key factors that affect utility judgments in the instruction design. And finally, (iii) to verify the efficacy of utility judgments in practical retrieval augmentation applications, we delve into LLMs' QA capabilities using the evidence judged with utility and direct dense retrieval results. (iv) We propose a k-sampling, listwise approach to reduce the dependency of LLMs on the sequence of input passages, thereby facilitating subsequent answer generation. We believe that the way we formalize and study the problem along with our findings contributes to a critical assessment of retrieval-augmented LLMs. Our code and benchmark can be found at \url{https://github.com/ict-bigdatalab/utility_judgments}.

Are Large Language Models Good at Utility Judgments?

TL;DR

This work investigates whether large language models can perform utility judgments on retrieved passages for open-domain QA, going beyond traditional relevance judgments. It introduces GTI and GTU benchmark settings, builds a diverse candidate-passage benchmark with ground-truth, counterfactual, and noisy passages, and evaluates five LLMs under various prompt forms. Key findings show that well-instructed models can distinguish utility from relevance, that utility-guided evidence often improves QA performance, and that a k-sampling listwise approach mitigates order sensitivity and boosts answer quality. The authors release code and benchmarks to advance understanding and design of retrieval-augmented systems that prioritize utility in evidence selection.

Abstract

Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering. In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain QA. Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our experiments reveal that: (i) well-instructed LLMs can distinguish between relevance and utility, and that LLMs are highly receptive to newly generated counterfactual passages. Moreover, (ii) we scrutinize key factors that affect utility judgments in the instruction design. And finally, (iii) to verify the efficacy of utility judgments in practical retrieval augmentation applications, we delve into LLMs' QA capabilities using the evidence judged with utility and direct dense retrieval results. (iv) We propose a k-sampling, listwise approach to reduce the dependency of LLMs on the sequence of input passages, thereby facilitating subsequent answer generation. We believe that the way we formalize and study the problem along with our findings contributes to a critical assessment of retrieval-augmented LLMs. Our code and benchmark can be found at \url{https://github.com/ict-bigdatalab/utility_judgments}.
Paper Structure (15 sections, 6 figures, 4 tables)

This paper contains 15 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An example between utility and relevance.
  • Figure 2: Prompts in blue blocks are utility prompts, where (a) is pointwise, (b) is pairwise, (c) is listwise-set, and (d) is listwise-rank. Prompts in green blocks are relevance prompts, where (e) is listwise-set and (f) is listwise-rank. Prompts in gray blocks are QA prompts, where (g) is designed for FQA datasets and (h) is designed for NFQA dataset.
  • Figure 3: Given 10 candidate passages, ChatGPT is employed to select evidence with utility and relevance respectively using listwise-set approaches. The selected evidence are then used by ChatGPT to answer questions. In the examples from NQ and MSMARCO-QA, "Passage-9" and "Passage-3" respectively denote the ground-truth supporting evidence. The full set of 10 candidate passages for each question can be accessed at https://github.com/ict-bigdatalab/utility_judgments.
  • Figure 4: Performance of ChatGPT with different input forms on the NQ and MSMARCO-QA datasets in different sequences of input between the question and passages. We use "F1" score on pointwise and listwise-set forms and "NDCG@1" score on pairwise and listwise-rank forms. "NDCG@5" has same trend with 'NDCG@1".
  • Figure 5: The performance of five different LLM in utility judgments based on the positions of ground-truth evidence in the input list across listwise-set and listwise-rank forms on different datasets.
  • ...and 1 more figures