Table of Contents
Fetching ...

Understanding the Effects of the Baidu-ULTR Logging Policy on Two-Tower Models

Morris de Haan, Philipp Hager

TL;DR

The paper investigates whether logging policy confounding undermines two-tower ULTR models on a real-world dataset (Baidu-ULTR). It estimates the logging policy with LambdaMART, finds strong alignment with the logged rankings, and demonstrates that confounding thus could exist in principle. However, when evaluating two-tower models with and without proposed debiasing (dropout and backdoor adjustment) on expert-annotated data, none of the debiasing methods improves—and some slightly reduces—ranking performance relative to a naive two-tower setup; expert annotations still outperform all click-based models. The findings challenge the practical relevance of logging-policy confounding for two-tower ULTR on Baidu-ULTR and suggest that the gap between click-only models and expert-guided models is driven more by data quality and annotation reliability than by confounding alone, with identifiability and distributional-shift issues as future directions.

Abstract

Despite the popularity of the two-tower model for unbiased learning to rank (ULTR) tasks, recent work suggests that it suffers from a major limitation that could lead to its collapse in industry applications: the problem of logging policy confounding. Several potential solutions have even been proposed; however, the evaluation of these methods was mostly conducted using semi-synthetic simulation experiments. This paper bridges the gap between theory and practice by investigating the confounding problem on the largest real-world dataset, Baidu-ULTR. Our main contributions are threefold: 1) we show that the conditions for the confounding problem are given on Baidu-ULTR, 2) the confounding problem bears no significant effect on the two-tower model, and 3) we point to a potential mismatch between expert annotations, the golden standard in ULTR, and user click behavior.

Understanding the Effects of the Baidu-ULTR Logging Policy on Two-Tower Models

TL;DR

The paper investigates whether logging policy confounding undermines two-tower ULTR models on a real-world dataset (Baidu-ULTR). It estimates the logging policy with LambdaMART, finds strong alignment with the logged rankings, and demonstrates that confounding thus could exist in principle. However, when evaluating two-tower models with and without proposed debiasing (dropout and backdoor adjustment) on expert-annotated data, none of the debiasing methods improves—and some slightly reduces—ranking performance relative to a naive two-tower setup; expert annotations still outperform all click-based models. The findings challenge the practical relevance of logging-policy confounding for two-tower ULTR on Baidu-ULTR and suggest that the gap between click-only models and expert-guided models is driven more by data quality and annotation reliability than by confounding alone, with identifiability and distributional-shift issues as future directions.

Abstract

Despite the popularity of the two-tower model for unbiased learning to rank (ULTR) tasks, recent work suggests that it suffers from a major limitation that could lead to its collapse in industry applications: the problem of logging policy confounding. Several potential solutions have even been proposed; however, the evaluation of these methods was mostly conducted using semi-synthetic simulation experiments. This paper bridges the gap between theory and practice by investigating the confounding problem on the largest real-world dataset, Baidu-ULTR. Our main contributions are threefold: 1) we show that the conditions for the confounding problem are given on Baidu-ULTR, 2) the confounding problem bears no significant effect on the two-tower model, and 3) we point to a potential mismatch between expert annotations, the golden standard in ULTR, and user click behavior.
Paper Structure (11 sections, 2 equations, 1 figure, 2 tables)

This paper contains 11 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Comparing ranking performance on Baidu-ULTR when test queries are binned by approx. logging policy performance.