Table of Contents
Fetching ...

A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling

Giulia Di Teodoro, Federico Siciliano, Nicola Tonellotto, Fabrizio Silvestri

TL;DR

This work provides a formal connection between BCE, CCE, and BPR losses and ranking metrics in recommender systems. It proves that, with the full set of negatives, CCE yields the tightest bound on ranking metrics such as NDCG and MRR, with BPR and BCE following, and that BPR and CCE become equivalent under single negative sampling while all three converge to a common global minimum when scores are bounded. Under sampling, the bounds become probabilistic, and the authors derive hypergeometric-based expressions to quantify these bounds, showing BCE often offers the strongest worst-case bound while CCE remains robust across settings. Experimental results across five datasets and multiple models validate the theory, revealing nuanced trade-offs: more negatives generally improve end performance, CCE tends to be stable across datasets, and BCE can outperform others in late training under certain conditions. The findings inform loss-function choice and sampling strategies in large-scale recommender systems and motivate future work on diverse sampling schemes and multi-user scenarios.

Abstract

Loss functions like Categorical Cross Entropy (CCE), Binary Cross Entropy (BCE), and Bayesian Personalized Ranking (BPR) are commonly used in training Recommender Systems (RSs) to differentiate positive items - those interacted with by users - and negative items. While prior works empirically showed that CCE outperforms BCE and BPR when using the full set of negative items, we provide a theoretical explanation for this by proving that CCE offers the tightest lower bound on ranking metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR), followed by BPR and BCE. However, using the full set of negative items is computationally infeasible for large-scale RSs, prompting the use of negative sampling techniques. Under negative sampling, we reveal that BPR and CCE are equivalent when a single negative sample is drawn, and all three losses converge to the same global minimum. We further demonstrate that the sampled losses remain lower bounds for NDCG (MRR), albeit in a probabilistic sense. Our worst-case analysis shows that BCE offers the strongest bound on NDCG (MRR). Experiments on five datasets and four models empirically support these theoretical findings. Our code and supplementary material are available at https://github.com/federicosiciliano/recsys_losses.git.

A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling

TL;DR

This work provides a formal connection between BCE, CCE, and BPR losses and ranking metrics in recommender systems. It proves that, with the full set of negatives, CCE yields the tightest bound on ranking metrics such as NDCG and MRR, with BPR and BCE following, and that BPR and CCE become equivalent under single negative sampling while all three converge to a common global minimum when scores are bounded. Under sampling, the bounds become probabilistic, and the authors derive hypergeometric-based expressions to quantify these bounds, showing BCE often offers the strongest worst-case bound while CCE remains robust across settings. Experimental results across five datasets and multiple models validate the theory, revealing nuanced trade-offs: more negatives generally improve end performance, CCE tends to be stable across datasets, and BCE can outperform others in late training under certain conditions. The findings inform loss-function choice and sampling strategies in large-scale recommender systems and motivate future work on diverse sampling schemes and multi-user scenarios.

Abstract

Loss functions like Categorical Cross Entropy (CCE), Binary Cross Entropy (BCE), and Bayesian Personalized Ranking (BPR) are commonly used in training Recommender Systems (RSs) to differentiate positive items - those interacted with by users - and negative items. While prior works empirically showed that CCE outperforms BCE and BPR when using the full set of negative items, we provide a theoretical explanation for this by proving that CCE offers the tightest lower bound on ranking metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR), followed by BPR and BCE. However, using the full set of negative items is computationally infeasible for large-scale RSs, prompting the use of negative sampling techniques. Under negative sampling, we reveal that BPR and CCE are equivalent when a single negative sample is drawn, and all three losses converge to the same global minimum. We further demonstrate that the sampled losses remain lower bounds for NDCG (MRR), albeit in a probabilistic sense. Our worst-case analysis shows that BCE offers the strongest bound on NDCG (MRR). Experiments on five datasets and four models empirically support these theoretical findings. Our code and supplementary material are available at https://github.com/federicosiciliano/recsys_losses.git.

Paper Structure

This paper contains 27 sections, 22 theorems, 73 equations, 36 figures, 1 table.

Key Result

Theorem 1

Let $\ell_{CCE}$, $\ell_{BPR}$, and $\ell_{BCE}$ denote the full forms of the losses. Then the following inequalities hold: and, if $s_+ \geq 0$, we further have:

Figures (36)

  • Figure 1: GRU4Rec NDCG@10 during training changing number of negative items and loss on ML-1M dataset.
  • Figure 2: SASRec NDCG@10 during training changing number of negative items and loss on ML-1M dataset.
  • Figure 3: SASRec and GRU4Rec NDCG@10 during training, using 1 negative item and changing loss on ML-1M dataset.
  • Figure 4: SASRec and GRU4Rec NDCG@10 during training, using 100 negative items and changing loss on ML-1M dataset.
  • Figure 5: SASRec and GRU4Rec NDCG@10 during training changing loss, using 100 negative items on Foursquare dataset.
  • ...and 31 more figures

Theorems & Definitions (39)

  • Theorem 1
  • proof
  • Proposition 1
  • Proposition 2
  • Lemma 1
  • proof
  • Lemma 2
  • Theorem 2
  • proof
  • Theorem 3
  • ...and 29 more