A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling
Giulia Di Teodoro, Federico Siciliano, Nicola Tonellotto, Fabrizio Silvestri
TL;DR
This work provides a formal connection between BCE, CCE, and BPR losses and ranking metrics in recommender systems. It proves that, with the full set of negatives, CCE yields the tightest bound on ranking metrics such as NDCG and MRR, with BPR and BCE following, and that BPR and CCE become equivalent under single negative sampling while all three converge to a common global minimum when scores are bounded. Under sampling, the bounds become probabilistic, and the authors derive hypergeometric-based expressions to quantify these bounds, showing BCE often offers the strongest worst-case bound while CCE remains robust across settings. Experimental results across five datasets and multiple models validate the theory, revealing nuanced trade-offs: more negatives generally improve end performance, CCE tends to be stable across datasets, and BCE can outperform others in late training under certain conditions. The findings inform loss-function choice and sampling strategies in large-scale recommender systems and motivate future work on diverse sampling schemes and multi-user scenarios.
Abstract
Loss functions like Categorical Cross Entropy (CCE), Binary Cross Entropy (BCE), and Bayesian Personalized Ranking (BPR) are commonly used in training Recommender Systems (RSs) to differentiate positive items - those interacted with by users - and negative items. While prior works empirically showed that CCE outperforms BCE and BPR when using the full set of negative items, we provide a theoretical explanation for this by proving that CCE offers the tightest lower bound on ranking metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR), followed by BPR and BCE. However, using the full set of negative items is computationally infeasible for large-scale RSs, prompting the use of negative sampling techniques. Under negative sampling, we reveal that BPR and CCE are equivalent when a single negative sample is drawn, and all three losses converge to the same global minimum. We further demonstrate that the sampled losses remain lower bounds for NDCG (MRR), albeit in a probabilistic sense. Our worst-case analysis shows that BCE offers the strongest bound on NDCG (MRR). Experiments on five datasets and four models empirically support these theoretical findings. Our code and supplementary material are available at https://github.com/federicosiciliano/recsys_losses.git.
