Table of Contents
Fetching ...

Unbiased Learning to Rank Meets Reality: Lessons from Baidu's Large-Scale Search Dataset

Philipp Hager, Romain Deffayet, Jean-Michel Renders, Onno Zoeter, Maarten de Rijke

TL;DR

This paper critically evaluates unbiased learning-to-rank (ULTR) methods on Baidu-ULTR, the largest real-world dataset with click logs and expert annotations. It compares several ULTR losses (pointwise/listwise) and two modeling paradigms (transformer-based cross-encoders and traditional rerankers) using a large reranking subset and a rigorously tuned experimental setup. The key finding is that ULTR yields little to no robust improvement in ranking performance on expert annotations, even though click prediction quality often improves, and that improvements largely depend on representation and loss choice rather than debiasing alone. The work highlights a notable divergence between click-based objectives and expert-annotated ranking, calls for more realistic evaluation frameworks, and suggests that ULTR theory may not readily translate to practice on real-world large-scale search systems.

Abstract

Unbiased learning-to-rank (ULTR) is a well-established framework for learning from user clicks, which are often biased by the ranker collecting the data. While theoretically justified and extensively tested in simulation, ULTR techniques lack empirical validation, especially on modern search engines. The Baidu-ULTR dataset released for the WSDM Cup 2023, collected from Baidu's search engine, offers a rare opportunity to assess the real-world performance of prominent ULTR techniques. Despite multiple submissions during the WSDM Cup 2023 and the subsequent NTCIR ULTRE-2 task, it remains unclear whether the observed improvements stem from applying ULTR or other learning techniques. In this work, we revisit and extend the available experiments on the Baidu-ULTR dataset. We find that standard unbiased learning-to-rank techniques robustly improve click predictions but struggle to consistently improve ranking performance, especially considering the stark differences obtained by choice of ranking loss and query-document features. Our experiments reveal that gains in click prediction do not necessarily translate to enhanced ranking performance on expert relevance annotations, implying that conclusions strongly depend on how success is measured in this benchmark.

Unbiased Learning to Rank Meets Reality: Lessons from Baidu's Large-Scale Search Dataset

TL;DR

This paper critically evaluates unbiased learning-to-rank (ULTR) methods on Baidu-ULTR, the largest real-world dataset with click logs and expert annotations. It compares several ULTR losses (pointwise/listwise) and two modeling paradigms (transformer-based cross-encoders and traditional rerankers) using a large reranking subset and a rigorously tuned experimental setup. The key finding is that ULTR yields little to no robust improvement in ranking performance on expert annotations, even though click prediction quality often improves, and that improvements largely depend on representation and loss choice rather than debiasing alone. The work highlights a notable divergence between click-based objectives and expert-annotated ranking, calls for more realistic evaluation frameworks, and suggests that ULTR theory may not readily translate to practice on real-world large-scale search systems.

Abstract

Unbiased learning-to-rank (ULTR) is a well-established framework for learning from user clicks, which are often biased by the ranker collecting the data. While theoretically justified and extensively tested in simulation, ULTR techniques lack empirical validation, especially on modern search engines. The Baidu-ULTR dataset released for the WSDM Cup 2023, collected from Baidu's search engine, offers a rare opportunity to assess the real-world performance of prominent ULTR techniques. Despite multiple submissions during the WSDM Cup 2023 and the subsequent NTCIR ULTRE-2 task, it remains unclear whether the observed improvements stem from applying ULTR or other learning techniques. In this work, we revisit and extend the available experiments on the Baidu-ULTR dataset. We find that standard unbiased learning-to-rank techniques robustly improve click predictions but struggle to consistently improve ranking performance, especially considering the stark differences obtained by choice of ranking loss and query-document features. Our experiments reveal that gains in click prediction do not necessarily translate to enhanced ranking performance on expert relevance annotations, implying that conclusions strongly depend on how success is measured in this benchmark.
Paper Structure (25 sections, 7 equations, 4 figures, 3 tables)

This paper contains 25 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The original experiments on Baidu-ULTR show that none of the four compared ULTR methods outperform a naive method not correcting for position bias. Data from Zou2022Baidu and visualization by us.
  • Figure 2: Position bias as estimated by RegressionEM and three intervention harvesting methods compared to the mean CTR. Propensities were normalized by position one.
  • Figure 3: Comparing ULTR methods on pre-trained BERT embeddings and LTR features. We display the average ranking performance measured in DCG@10 over five independent runs and plot a bootstrapped 95% confidence interval. The grey dotted line indicates the performance of a random ranker.
  • Figure 4: Click prediction performance of pointwise methods measured in negative log-likelihood; lower is better.