Table of Contents
Fetching ...

Can LLM Annotations Replace User Clicks for Learning to Rank?

Lulu Yu, Keping Bi, Jiafeng Guo, Shihao Liu, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng

TL;DR

The paper investigates whether LLM-based relevance annotations can replace click data for learning-to-rank. It shows that clicks better capture document-level signals and perform strongly on high-frequency queries, while LLM annotations excel at semantic matching and fare better on medium- and low-frequency queries; together they are complementary. Two hybrid training strategies, data scheduling and frequency-aware multi-objective learning (FAMOL), effectively fuse both supervision signals, with FAMOL delivering the most consistent gains across query frequencies. The findings provide practical guidance for leveraging both supervision sources in production ranking systems and highlight the value of semantic and document-level cues in LTR.

Abstract

Large-scale supervised data is essential for training modern ranking models, but obtaining high-quality human annotations is costly. Click data has been widely used as a low-cost alternative, and with recent advances in large language models (LLMs), LLM-based relevance annotation has emerged as another promising annotation. This paper investigates whether LLM annotations can replace click data for learning to rank (LTR) by conducting a comprehensive comparison across multiple dimensions. Experiments on both a public dataset, TianGong-ST, and an industrial dataset, Baidu-Click, show that click-supervised models perform better on high-frequency queries, while LLM annotation-supervised models are more effective on medium- and low-frequency queries. Further analysis shows that click-supervised models are better at capturing document-level signals such as authority or quality, while LLM annotation-supervised models are more effective at modeling semantic matching between queries and documents and at distinguishing relevant from non-relevant documents. Motivated by these observations, we explore two training strategies -- data scheduling and frequency-aware multi-objective learning -- that integrate both supervision signals. Both approaches enhance ranking performance across queries at all frequency levels, with the latter being more effective. Our code is available at https://github.com/Trustworthy-Information-Access/LLMAnn_Click.

Can LLM Annotations Replace User Clicks for Learning to Rank?

TL;DR

The paper investigates whether LLM-based relevance annotations can replace click data for learning-to-rank. It shows that clicks better capture document-level signals and perform strongly on high-frequency queries, while LLM annotations excel at semantic matching and fare better on medium- and low-frequency queries; together they are complementary. Two hybrid training strategies, data scheduling and frequency-aware multi-objective learning (FAMOL), effectively fuse both supervision signals, with FAMOL delivering the most consistent gains across query frequencies. The findings provide practical guidance for leveraging both supervision sources in production ranking systems and highlight the value of semantic and document-level cues in LTR.

Abstract

Large-scale supervised data is essential for training modern ranking models, but obtaining high-quality human annotations is costly. Click data has been widely used as a low-cost alternative, and with recent advances in large language models (LLMs), LLM-based relevance annotation has emerged as another promising annotation. This paper investigates whether LLM annotations can replace click data for learning to rank (LTR) by conducting a comprehensive comparison across multiple dimensions. Experiments on both a public dataset, TianGong-ST, and an industrial dataset, Baidu-Click, show that click-supervised models perform better on high-frequency queries, while LLM annotation-supervised models are more effective on medium- and low-frequency queries. Further analysis shows that click-supervised models are better at capturing document-level signals such as authority or quality, while LLM annotation-supervised models are more effective at modeling semantic matching between queries and documents and at distinguishing relevant from non-relevant documents. Motivated by these observations, we explore two training strategies -- data scheduling and frequency-aware multi-objective learning -- that integrate both supervision signals. Both approaches enhance ranking performance across queries at all frequency levels, with the latter being more effective. Our code is available at https://github.com/Trustworthy-Information-Access/LLMAnn_Click.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of annotation characteristics between click data and LLM annotations
  • Figure 2: Annotation strategies with LLMs: (a) PointAnn, (b) ListAnn, (c) ListRank and (d) ListSel.
  • Figure 3: Distribution of PointAnn, ListAnn, and human annotations at various relevance levels.
  • Figure 4: Effects of document-level and query-document-level LTR features on models.
  • Figure 5: Effects of true negatives in training and test sets on model performance.
  • ...and 2 more figures