Large language models can accurately predict searcher preferences
Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra
TL;DR
The paper addresses the challenge of obtaining high-quality relevance labels at scale by leveraging first-party feedback to train prompts that guide large language models (LLMs) in labeling. Through experiments on TREC-Robust and deployment insights from Bing, the authors demonstrate that carefully designed prompts can yield LLM labels that rival or exceed human assessors in accuracy and ranking fidelity while substantially reducing cost and time. They show that prompt content (aspects, narrative) and even paraphrase variation significantly influence performance, and they establish a monitoring framework to ensure label quality in production. The work argues for adopting LLM-based labeling as a scalable, ground-truth-aligned approach for evaluating and training rankers, while acknowledging biases, safety concerns, and the need for continual validation and oversight.
Abstract
Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality "gold" labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.
