Table of Contents
Fetching ...

Are LLM-based Recommenders Already the Best? Simple Scaled Cross-entropy Unleashes the Potential of Traditional Sequential Recommenders

Cong Xu, Zhangchi Zhu, Mo Yu, Jun Wang, Jianyong Wang, Wei Zhang

TL;DR

This paper questions the presumed supremacy of LLM-based recommenders in sequential tasks by presenting a principled analysis of the cross-entropy loss. It introduces the notions of tightness and coverage for the CE normalization term and shows that full softmax CE provides strong ranking proxy bounds, while tempered, adaptive truncations reveal trade-offs. A simple yet effective approach, Scaled Cross-Entropy (SCE), is proposed to approximate full softmax with far fewer negatives, enabling traditional sequential models to achieve competitive or superior ranking performance compared to LLM-based methods under practical settings. Empirical results on Beauty and Yelp demonstrate that, when training losses are aligned, traditional models can surpass LLM-based recruiters, and SCE offers a practical path to close the gap with reduced computational burden. Overall, the work calls for fair, objective evaluation of LLM-based recommendations and provides actionable guidance for leveraging traditional models at scale.

Abstract

Large language models (LLMs) have been garnering increasing attention in the recommendation community. Some studies have observed that LLMs, when fine-tuned by the cross-entropy (CE) loss with a full softmax, could achieve `state-of-the-art' performance in sequential recommendation. However, most of the baselines used for comparison are trained using a pointwise/pairwise loss function. This inconsistent experimental setting leads to the underestimation of traditional methods and further fosters over-confidence in the ranking capability of LLMs. In this study, we provide theoretical justification for the superiority of the cross-entropy loss by demonstrating its two desirable properties: tightness and coverage. Furthermore, this study sheds light on additional novel insights: 1) Taking into account only the recommendation performance, CE is not yet optimal as it is not a quite tight bound in terms of some ranking metrics. 2) In scenarios that full softmax cannot be performed, an effective alternative is to scale up the sampled normalizing term. These findings then help unleash the potential of traditional recommendation models, allowing them to surpass LLM-based counterparts. Given the substantial computational burden, existing LLM-based methods are not as effective as claimed for sequential recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future.

Are LLM-based Recommenders Already the Best? Simple Scaled Cross-entropy Unleashes the Potential of Traditional Sequential Recommenders

TL;DR

This paper questions the presumed supremacy of LLM-based recommenders in sequential tasks by presenting a principled analysis of the cross-entropy loss. It introduces the notions of tightness and coverage for the CE normalization term and shows that full softmax CE provides strong ranking proxy bounds, while tempered, adaptive truncations reveal trade-offs. A simple yet effective approach, Scaled Cross-Entropy (SCE), is proposed to approximate full softmax with far fewer negatives, enabling traditional sequential models to achieve competitive or superior ranking performance compared to LLM-based methods under practical settings. Empirical results on Beauty and Yelp demonstrate that, when training losses are aligned, traditional models can surpass LLM-based recruiters, and SCE offers a practical path to close the gap with reduced computational burden. Overall, the work calls for fair, objective evaluation of LLM-based recommendations and provides actionable guidance for leveraging traditional models at scale.

Abstract

Large language models (LLMs) have been garnering increasing attention in the recommendation community. Some studies have observed that LLMs, when fine-tuned by the cross-entropy (CE) loss with a full softmax, could achieve `state-of-the-art' performance in sequential recommendation. However, most of the baselines used for comparison are trained using a pointwise/pairwise loss function. This inconsistent experimental setting leads to the underestimation of traditional methods and further fosters over-confidence in the ranking capability of LLMs. In this study, we provide theoretical justification for the superiority of the cross-entropy loss by demonstrating its two desirable properties: tightness and coverage. Furthermore, this study sheds light on additional novel insights: 1) Taking into account only the recommendation performance, CE is not yet optimal as it is not a quite tight bound in terms of some ranking metrics. 2) In scenarios that full softmax cannot be performed, an effective alternative is to scale up the sampled normalizing term. These findings then help unleash the potential of traditional recommendation models, allowing them to surpass LLM-based counterparts. Given the substantial computational burden, existing LLM-based methods are not as effective as claimed for sequential recommendation. We hope that these theoretical understandings in conjunction with the empirical results will facilitate an objective evaluation of LLM-based recommendation in the future.
Paper Structure (18 sections, 8 theorems, 29 equations, 4 figures, 4 tables)

This paper contains 18 sections, 8 theorems, 29 equations, 4 figures, 4 tables.

Key Result

Lemma 1

Minimizing the cross-entropy loss $\ell_{\mathrm{CE}}$ is equivalent to maximizing a lower bound of NDCG and MRR.

Figures (4)

  • Figure 1: (a) Performance comparison based on tighter bounds for NDCG. The dashed line represents the results trained by CE (namely the case of $\eta \rightarrow +\infty$). (b) Tightness and Coverage illustration of different loss functions. $\eta^*$ is a task-specific value derived from (a).
  • Figure 2: Bounding probabilities of SCE at different values of $\alpha = 1, 5, 100$ across varying ranks $r_+$ and number of negative items $K$. The probabilities are calculated based on Eq. \ref{['eq-ssm-bound-prob']} and Eq. \ref{['eq-sce-bounds']}, with negative values being clipped to zero. For the Beauty dataset, $|\mathcal{I}| = 12101$.
  • Figure 3: Sampled softmax loss comparison on Beauty. Left: $\ell_{\text{SSM}}$ versus $\ell_{\text{SCE}} \: (\alpha=100)$ across different number of negative items. Right: $\ell_{\text{SCE}} \: (K=100)$ with an increasing value of $\alpha$.
  • Figure 4: Performance comparisons between state-of-the-art LLM-based recommenders and traditional models using $\ell_{\text{CE}}$ and $\ell_{\text{SCE}} \:(\alpha=100)$. Existing LLM-based recommenders are not as effective as claimed even in a practical scenario.

Theorems & Definitions (15)

  • Lemma 1: loss:Bound:Bruch:2019
  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Proposition 2
  • proof
  • Lemma 2
  • ...and 5 more