Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

Cong Xu; Zhangchi Zhu; Jun Wang; Jianyong Wang; Wei Zhang

Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, Wei Zhang

TL;DR

This paper analyzes the fairness of evaluating LLM-based recommendations using cross-entropy (CE) and shows CE with a full softmax corresponds to a bound on ranking metrics such as NDCG and RR, justifying why optimizing CE improves ranking. It then develops practical CE-like approximations—Noise Contrastive Estimation (NCE) and Scaled Cross-Entropy (SCE)—with theoretical bounds that depend on the current target rank $r_+$, enabling effective training with a small sampled set of negatives. Empirically, conventional models trained with CE outperform recent LLM-based methods on next-item recommendation across Beauty, MovieLens-1M, and Yelp, while NCE and SCE can achieve comparable performance with far fewer negative samples, revealing over-optimistic claims about LLMs when fair baselines are used. The work provides objective evaluation guidelines and practical training choices (e.g., $c \approx 10$, $\alpha \approx 100$, modest $K$) to approximate CE, facilitating fair comparisons and better understanding of LLM-based recommender systems' true ranking capabilities.

Abstract

Large language models (LLMs) have gained much attention in the recommendation community; some studies have observed that LLMs, fine-tuned by the cross-entropy loss with a full softmax, could achieve state-of-the-art performance already. However, these claims are drawn from unobjective and unfair comparisons. In view of the substantial quantity of items in reality, conventional recommenders typically adopt a pointwise/pairwise loss function instead for training. This substitute however causes severe performance degradation, leading to under-estimation of conventional methods and over-confidence in the ranking capability of LLMs. In this work, we theoretically justify the superiority of cross-entropy, and showcase that it can be adequately replaced by some elementary approximations with certain necessary modifications. The remarkable results across three public datasets corroborate that even in a practical sense, existing LLM-based methods are not as effective as claimed for next-item recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future.

Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

TL;DR

, enabling effective training with a small sampled set of negatives. Empirically, conventional models trained with CE outperform recent LLM-based methods on next-item recommendation across Beauty, MovieLens-1M, and Yelp, while NCE and SCE can achieve comparable performance with far fewer negative samples, revealing over-optimistic claims about LLMs when fair baselines are used. The work provides objective evaluation guidelines and practical training choices (e.g.,

, modest

) to approximate CE, facilitating fair comparisons and better understanding of LLM-based recommender systems' true ranking capabilities.

Abstract

Paper Structure (22 sections, 13 theorems, 43 equations, 9 figures, 8 tables)

This paper contains 22 sections, 13 theorems, 43 equations, 9 figures, 8 tables.

Introduction
Related Work
Preliminaries
The Role of Cross-Entropy Loss in Optimizing Ranking Capability
Cross-Entropy for Some Ranking Metrics
Revisiting Noise Contrastive Estimation
Scaling Up the Sampled Normalizing Term
Computational Complexity
Experiments
Experimental Setup
Overall Performance Evaluation
Other Factors for Objective Evaluation
Conclusion
Experimental setup
Overview of Loss Function
...and 7 more sections

Key Result

proposition 1

For a target item $v_+$ which is ranked as $r_+$, the following inequality holds true for any $n \ge r_+$ where

Figures (9)

Figure 1: Recommendation performance comparisons. The marker size depicts the number of model parameters: 60M for P5 (CID + IID) LLM4Rec:CIDIID:Hua2023, 7B for LlamaRec LLM4Rec:LLaMARec:Yue2023 and E4SRec LLM4Rec:E4SRec:Li:2023, and merely $\le 1$M for SASRec Seq:SASRec:Kang:2018.
Figure 2: Performance comparison based on tighter bounds for NDCG. The dashed line represents the results trained by CE (namely the case of $\eta \rightarrow +\infty$).
Figure 3: NDCG@10 performance of NCE and NEG across different number of negative samples.
Figure 4: NDCG@10 performance under various weight $\alpha$.
Figure 5: Relative gaps between SCE and NCE.
...and 4 more figures

Theorems & Definitions (14)

proposition 1
corollary 1: loss:Bound:Bruch:2019
theorem 1
theorem 2
proposition 2
lemma 1
lemma 2
lemma 3
lemma 4
lemma 5
...and 4 more

Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

TL;DR

Abstract

Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)