Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs

Gleb Mezentsev; Danil Gusak; Ivan Oseledets; Evgeny Frolov

Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs

Gleb Mezentsev, Danil Gusak, Ivan Oseledets, Evgeny Frolov

TL;DR

The paper addresses the scalability of Cross-Entropy loss in sequential recommendations with large item catalogs by introducing Scalable Cross-Entropy (SCE), which approximates CE through GPU-friendly, bucketed computations that focus on hard negatives. SCE uses random bucket centers and top-k selections to form a reduced set of logits, coupled with a Mix operation to mitigate bucket collapse, enabling memory-efficient training of SASRec without sacrificing accuracy. Empirical results show substantial memory reductions (up to 100x) and faster training (up to 6.7x) while achieving or surpassing the performance of strong baselines across five diverse datasets, and competitive results against recent models on Amazon Beauty. The approach has broad applicability beyond recommender systems, potentially benefiting large-vocabulary NLP models and other domains with large output spaces.

Abstract

Scalability issue plays a crucial role in productionizing modern recommender systems. Even lightweight architectures may suffer from high computational overload due to intermediate calculations, limiting their practicality in real-world applications. Specifically, applying full Cross-Entropy (CE) loss often yields state-of-the-art performance in terms of recommendations quality. Still, it suffers from excessive GPU memory utilization when dealing with large item catalogs. This paper introduces a novel Scalable Cross-Entropy (SCE) loss function in the sequential learning setup. It approximates the CE loss for datasets with large-size catalogs, enhancing both time efficiency and memory usage without compromising recommendations quality. Unlike traditional negative sampling methods, our approach utilizes a selective GPU-efficient computation strategy, focusing on the most informative elements of the catalog, particularly those most likely to be false positives. This is achieved by approximating the softmax distribution over a subset of the model outputs through the maximum inner product search. Experimental results on multiple datasets demonstrate the effectiveness of SCE in reducing peak memory usage by a factor of up to 100 compared to the alternatives, retaining or even exceeding their metrics values. The proposed approach also opens new perspectives for large-scale developments in different domains, such as large language models.

Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 6 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Transformer-Based Sequential Recommenders
Approaches to Negative Sampling and Cross-Entropy Approximation
Proposed approach
Scalable Cross-Entropy
Bucket Collapse Mitigation
Method Applicability
Experiments
Experimental Settings
Datasets
Evaluation
Model and Baselines
Results
Dependence on SCE Hyperparameters
...and 5 more sections

Figures (6)

Figure 1: Impact of different components on peak GPU memory when training SASRec with Cross-Entropy loss. The measurements are performed using PyTorch library memory profiling tools.
Figure 2: Effect of $\alpha$ and $\beta$ on SASRec-SCE performance on Kindle Store dataset. Each curve is a Pareto front for different values of $s$ and $b_y$. Curves corresponding to a $(\alpha, \beta)$ pair are in purple solid lines, curves corresponding to the same $\alpha$ are red dotted lines, and curves corresponding to the same $\beta$ are dash-dotted blue lines. Dashed lines indicate that no configurations yield a higher NDCG@10 for a larger GPU memory budget.
Figure 3:
Figure 4:
Figure 6: Peak GPU memory utilization during training stage for different catalog sizes. Models trained with batch size equal 64. Models with SCE/BCE$^+$/gBCE/CE$^{-}$ as a loss are trained with 256 negatives.
...and 1 more figures

Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs

TL;DR

Abstract

Scalable Cross-Entropy Loss for Sequential Recommendations with Large Item Catalogs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)