Table of Contents
Fetching ...

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation

Hengran Zhang, Minghao Tang, Keping Bi, Jiafeng Guo, Shihao Liu, Daiting Shi, Dawei Yin, Xueqi Cheng

TL;DR

This work tackles the high cost of human relevance annotations by proposing utility-focused LLM annotations to train retrievers for retrieval-augmented generation (RAG). It introduces SumMargLH, a loss that handles multiple positive annotations per query, and evaluates annotation strategies (RelSel, UtilSel, UtilRank) using Qwen-2.5-32B on MS MARCO and BEIR. Across in-domain and out-of-domain settings, LLM-annotated utilities improve OOD retrieval and RAG performance, and combining LLM with limited human labels via curriculum learning can match or exceed full human-label performance. This approach offers a scalable pathway to initialize QA systems on new corpora with reduced labeling costs and robust generalization to new domains.

Abstract

This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation

TL;DR

This work tackles the high cost of human relevance annotations by proposing utility-focused LLM annotations to train retrievers for retrieval-augmented generation (RAG). It introduces SumMargLH, a loss that handles multiple positive annotations per query, and evaluates annotation strategies (RelSel, UtilSel, UtilRank) using Qwen-2.5-32B on MS MARCO and BEIR. Across in-domain and out-of-domain settings, LLM-annotated utilities improve OOD retrieval and RAG performance, and combining LLM with limited human labels via curriculum learning can match or exceed full human-label performance. This approach offers a scalable pathway to initialize QA systems on new corpora with reduced labeling costs and robust generalization to new domains.

Abstract

This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.

Paper Structure

This paper contains 38 sections, 6 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Different annotation methodologies: (a) Human annotation, (b) Using downstream task performance as utility score, (c) Our utility-focused annotation pipeline. The prompts are illustrative, see Appendix \ref{['app:prompts']} for details.
  • Figure 2: Positive annotation distribution of different annotators at various stages.
  • Figure 3: (a): Retrieval performance (%) with different human annotation ratios in curriculum learning; (b): Annotation quality evaluation (%) and retrieval performance (%) with different thresholds for UtilRank.
  • Figure 4: Relevance-based selection prompt for LLMs.
  • Figure 5: Pseudo-answer generation prompt for LLMs.
  • ...and 2 more figures