Table of Contents
Fetching ...

Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

Cong Xu, Zhangchi Zhu, Jun Wang, Jianyong Wang, Wei Zhang

TL;DR

This paper analyzes the fairness of evaluating LLM-based recommendations using cross-entropy (CE) and shows CE with a full softmax corresponds to a bound on ranking metrics such as NDCG and RR, justifying why optimizing CE improves ranking. It then develops practical CE-like approximations—Noise Contrastive Estimation (NCE) and Scaled Cross-Entropy (SCE)—with theoretical bounds that depend on the current target rank $r_+$, enabling effective training with a small sampled set of negatives. Empirically, conventional models trained with CE outperform recent LLM-based methods on next-item recommendation across Beauty, MovieLens-1M, and Yelp, while NCE and SCE can achieve comparable performance with far fewer negative samples, revealing over-optimistic claims about LLMs when fair baselines are used. The work provides objective evaluation guidelines and practical training choices (e.g., $c \approx 10$, $\alpha \approx 100$, modest $K$) to approximate CE, facilitating fair comparisons and better understanding of LLM-based recommender systems' true ranking capabilities.

Abstract

Large language models (LLMs) have gained much attention in the recommendation community; some studies have observed that LLMs, fine-tuned by the cross-entropy loss with a full softmax, could achieve state-of-the-art performance already. However, these claims are drawn from unobjective and unfair comparisons. In view of the substantial quantity of items in reality, conventional recommenders typically adopt a pointwise/pairwise loss function instead for training. This substitute however causes severe performance degradation, leading to under-estimation of conventional methods and over-confidence in the ranking capability of LLMs. In this work, we theoretically justify the superiority of cross-entropy, and showcase that it can be adequately replaced by some elementary approximations with certain necessary modifications. The remarkable results across three public datasets corroborate that even in a practical sense, existing LLM-based methods are not as effective as claimed for next-item recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future.

Understanding the Role of Cross-Entropy Loss in Fairly Evaluating Large Language Model-based Recommendation

TL;DR

This paper analyzes the fairness of evaluating LLM-based recommendations using cross-entropy (CE) and shows CE with a full softmax corresponds to a bound on ranking metrics such as NDCG and RR, justifying why optimizing CE improves ranking. It then develops practical CE-like approximations—Noise Contrastive Estimation (NCE) and Scaled Cross-Entropy (SCE)—with theoretical bounds that depend on the current target rank , enabling effective training with a small sampled set of negatives. Empirically, conventional models trained with CE outperform recent LLM-based methods on next-item recommendation across Beauty, MovieLens-1M, and Yelp, while NCE and SCE can achieve comparable performance with far fewer negative samples, revealing over-optimistic claims about LLMs when fair baselines are used. The work provides objective evaluation guidelines and practical training choices (e.g., , , modest ) to approximate CE, facilitating fair comparisons and better understanding of LLM-based recommender systems' true ranking capabilities.

Abstract

Large language models (LLMs) have gained much attention in the recommendation community; some studies have observed that LLMs, fine-tuned by the cross-entropy loss with a full softmax, could achieve state-of-the-art performance already. However, these claims are drawn from unobjective and unfair comparisons. In view of the substantial quantity of items in reality, conventional recommenders typically adopt a pointwise/pairwise loss function instead for training. This substitute however causes severe performance degradation, leading to under-estimation of conventional methods and over-confidence in the ranking capability of LLMs. In this work, we theoretically justify the superiority of cross-entropy, and showcase that it can be adequately replaced by some elementary approximations with certain necessary modifications. The remarkable results across three public datasets corroborate that even in a practical sense, existing LLM-based methods are not as effective as claimed for next-item recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future.
Paper Structure (22 sections, 13 theorems, 43 equations, 9 figures, 8 tables)

This paper contains 22 sections, 13 theorems, 43 equations, 9 figures, 8 tables.

Key Result

proposition 1

For a target item $v_+$ which is ranked as $r_+$, the following inequality holds true for any $n \ge r_+$ where

Figures (9)

  • Figure 1: Recommendation performance comparisons. The marker size depicts the number of model parameters: 60M for P5 (CID + IID) LLM4Rec:CIDIID:Hua2023, 7B for LlamaRec LLM4Rec:LLaMARec:Yue2023 and E4SRec LLM4Rec:E4SRec:Li:2023, and merely $\le 1$M for SASRec Seq:SASRec:Kang:2018.
  • Figure 2: Performance comparison based on tighter bounds for NDCG. The dashed line represents the results trained by CE (namely the case of $\eta \rightarrow +\infty$).
  • Figure 3: NDCG@10 performance of NCE and NEG across different number of negative samples.
  • Figure 4: NDCG@10 performance under various weight $\alpha$.
  • Figure 5: Relative gaps between SCE and NCE.
  • ...and 4 more figures

Theorems & Definitions (14)

  • proposition 1
  • corollary 1: loss:Bound:Bruch:2019
  • theorem 1
  • theorem 2
  • proposition 2
  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • ...and 4 more