Understanding the Ranking Loss for Recommendation with Sparse User Feedback

Zhutian Lin; Junwei Pan; Shangyu Zhang; Ximei Wang; Xi Xiao; Shudong Huang; Lei Xiao; Jie Jiang

Understanding the Ranking Loss for Recommendation with Sparse User Feedback

Zhutian Lin, Junwei Pan, Shangyu Zhang, Ximei Wang, Xi Xiao, Shudong Huang, Lei Xiao, Jie Jiang

TL;DR

This work addresses the challenge that binary cross entropy (BCE) optimization in CTR prediction suffers from gradient vanishing for negative samples when positives are scarce. By introducing an auxiliary ranking loss (as in Combined-Pair) the model obtains larger gradients on negative examples, which improves both the classification objective and ranking performance. The authors present theoretical gradient analyses, offline experiments on synthetic-sparse data, and online deployments in Tencent that show consistent improvements in BCE loss, AUC, and GMV across multiple scenarios, including new ads. They further explore stability and compatibility, and extend the approach with variants like Combined-Contrastive, Focal Loss, and other ranking-loss combinations, demonstrating broad practical benefits for online advertising systems.

Abstract

Click-through rate (CTR) prediction is a crucial area of research in online advertising. While binary cross entropy (BCE) has been widely used as the optimization objective for treating CTR prediction as a binary classification problem, recent advancements have shown that combining BCE loss with an auxiliary ranking loss can significantly improve performance. However, the full effectiveness of this combination loss is not yet fully understood. In this paper, we uncover a new challenge associated with the BCE loss in scenarios where positive feedback is sparse: the issue of gradient vanishing for negative samples. We introduce a novel perspective on the effectiveness of the auxiliary ranking loss in CTR prediction: it generates larger gradients on negative samples, thereby mitigating the optimization difficulties when using the BCE loss only and resulting in improved classification ability. To validate our perspective, we conduct theoretical analysis and extensive empirical evaluations on public datasets. Additionally, we successfully integrate the ranking loss into Tencent's online advertising system, achieving notable lifts of 0.70% and 1.26% in Gross Merchandise Value (GMV) for two main scenarios. The code is openly accessible at: https://github.com/SkylerLinn/Understanding-the-Ranking-Loss.

Understanding the Ranking Loss for Recommendation with Sparse User Feedback

TL;DR

Abstract

Paper Structure (30 sections, 16 equations, 9 figures, 5 tables)

This paper contains 30 sections, 16 equations, 9 figures, 5 tables.

Introduction
Related Work
Click-Through Rate Prediction
Binary Cross Entropy
Learning to Rank
Tailor Ranking Loss into CTR Prediction
Auxiliary Ranking Loss in CTR Prediction: A Gradient Vanishing Perspective
Investigation of Classification Ability
Gradients Analysis
Gradients of BCE Loss for Negative Samples
Gradients of BCE Loss for Positive Samples
Gradients of Combined-Pair for Negative Samples
Empirical Analysis of Gradient Vanishing
Experiments
Dataset and Experimental Setting
...and 15 more sections

Figures (9)

Figure 1: (a) BCE Loss Dynamics along epochs on the training and validation set and (b) Loss Landscape of BCE method (blue) and Combined-Pair (red).
Figure 2: Gradient norm dynamics of negative samples logits in BCE method and BCE-pairwise ranking combination methods (left) and BCE-listwise ranking combination methods (right) on the Criteo dataset in the first training epoch. All methods set $\alpha=0.5$ in both plots.
Figure 3: Gradient norm dynamics of DNN (left) and CrossNet (right) in DCN V2 through the training process. pct and avg. are shorts for percentile and average.
Figure 4: Performance evaluation of Combined-Pair and the BCE method under varying positive sparsity rates. Here $\beta_\text{pos}$ denotes the weights of positive samples.
Figure 5: AUC and negative BCE loss of Combined-Pair and BCE method in Criteo test set. $\alpha$ and $1-\alpha$ are the weights of BCE loss and RankNet loss within Combined-Pair, respectively. The diamond in red represents the BCE method, i.e., $\alpha=1$. We shows results with $\alpha$ ranging between $[0.1,1]$
...and 4 more figures

Understanding the Ranking Loss for Recommendation with Sparse User Feedback

TL;DR

Abstract

Understanding the Ranking Loss for Recommendation with Sparse User Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (9)