Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation
Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, Wei Zhang
TL;DR
This paper analyzes the fairness of evaluating LLM-based recommendations using cross-entropy (CE) and shows CE with a full softmax corresponds to a bound on ranking metrics such as NDCG and RR, justifying why optimizing CE improves ranking. It then develops practical CE-like approximations—Noise Contrastive Estimation (NCE) and Scaled Cross-Entropy (SCE)—with theoretical bounds that depend on the current target rank $r_+$, enabling effective training with a small sampled set of negatives. Empirically, conventional models trained with CE outperform recent LLM-based methods on next-item recommendation across Beauty, MovieLens-1M, and Yelp, while NCE and SCE can achieve comparable performance with far fewer negative samples, revealing over-optimistic claims about LLMs when fair baselines are used. The work provides objective evaluation guidelines and practical training choices (e.g., $c \approx 10$, $\alpha \approx 100$, modest $K$) to approximate CE, facilitating fair comparisons and better understanding of LLM-based recommender systems' true ranking capabilities.
Abstract
Large language models (LLMs) have gained much attention in the recommendation community; some studies have observed that LLMs, fine-tuned by the cross-entropy loss with a full softmax, could achieve state-of-the-art performance already. However, these claims are drawn from unobjective and unfair comparisons. In view of the substantial quantity of items in reality, conventional recommenders typically adopt a pointwise/pairwise loss function instead for training. This substitute however causes severe performance degradation, leading to under-estimation of conventional methods and over-confidence in the ranking capability of LLMs. In this work, we theoretically justify the superiority of cross-entropy, and showcase that it can be adequately replaced by some elementary approximations with certain necessary modifications. The remarkable results across three public datasets corroborate that even in a practical sense, existing LLM-based methods are not as effective as claimed for next-item recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future.
