Large language models can accurately predict searcher preferences

Paul Thomas; Seth Spielman; Nick Craswell; Bhaskar Mitra

Large language models can accurately predict searcher preferences

Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra

TL;DR

The paper addresses the challenge of obtaining high-quality relevance labels at scale by leveraging first-party feedback to train prompts that guide large language models (LLMs) in labeling. Through experiments on TREC-Robust and deployment insights from Bing, the authors demonstrate that carefully designed prompts can yield LLM labels that rival or exceed human assessors in accuracy and ranking fidelity while substantially reducing cost and time. They show that prompt content (aspects, narrative) and even paraphrase variation significantly influence performance, and they establish a monitoring framework to ensure label quality in production. The work argues for adopting LLM-based labeling as a scalable, ground-truth-aligned approach for evaluating and training rankers, while acknowledging biases, safety concerns, and the need for continual validation and oversight.

Abstract

Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality "gold" labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

Large language models can accurately predict searcher preferences

TL;DR

Abstract

Paper Structure (27 sections, 4 figures, 5 tables)

This paper contains 27 sections, 4 figures, 5 tables.

Labelling relevance
Labelling relevance with an LLM
Machinery and data
Prompting
Evaluating the labels
Document labels
Document preference
Query ordering
System ordering
Ground-truth preferences between results
Other criteria
Results
Comparing scores
Effect of prompt features
Effect of paraphrasing prompts
...and 12 more sections

Figures (4)

Figure 1: Labelling options discussed in this work, along with the cost and accuracy we see at Bing. A traditional approach uses gold and silver labels to improve crowd workers; we use gold labels to select LLMs and prompts.
Figure 2: General form of the prompts used in our TREC Robust experiments. Italicised words are placeholders, filled with appropriate values. Shaded text is optional, included in some prompt variants.
Figure 3: Examples of paraphrased prompts (extracts), based on prompt format "-DNA-" (description, narrative, and aspects).
Figure 4: Variation in Cohen's $\kappa$ between LLM labels and human labels, over a stratified sample of 3000 documents from TREC-Robust, as we paraphrase the prompt.

Large language models can accurately predict searcher preferences

TL;DR

Abstract

Large language models can accurately predict searcher preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (4)