Table of Contents
Fetching ...

Evaluating Performance and Bias of Negative Sampling in Large-Scale Sequential Recommendation Models

Arushi Prakash, Dimitrios Bermperidis, Srivas Chennu

TL;DR

It is found that commonly used random negative sampling reinforces popularity bias and performs best for head items, and in-batch and global popularity negative sampling can offer balanced performance at the cost of lower overall model performance results.

Abstract

Large-scale industrial recommendation models predict the most relevant items from catalogs containing millions or billions of options. To train these models efficiently, a small set of irrelevant items (negative samples) is selected from the vast catalog for each relevant item (positive example), helping the model distinguish between relevant and irrelevant items. Choosing the right negative sampling method is a common challenge. We address this by implementing and comparing various negative sampling methods - random, popularity-based, in-batch, mixed, adaptive, and adaptive with mixed variants - on modern sequential recommendation models. Our experiments, including hyperparameter optimization and 20x repeats on three benchmark datasets with varying popularity biases, show how the choice of method and dataset characteristics impact key model performance metrics. We also reveal that average performance metrics often hide imbalances across popularity bands (head, mid, tail). We find that commonly used random negative sampling reinforces popularity bias and performs best for head items. Popularity-based methods (in-batch and global popularity negative sampling) can offer balanced performance at the cost of lower overall model performance results. Our study serves as a practical guide to the trade-offs in selecting a negative sampling method for large-scale sequential recommendation models. Code, datasets, experimental results and hyperparameters are available at: https://github.com/apple/ml-negative-sampling.

Evaluating Performance and Bias of Negative Sampling in Large-Scale Sequential Recommendation Models

TL;DR

It is found that commonly used random negative sampling reinforces popularity bias and performs best for head items, and in-batch and global popularity negative sampling can offer balanced performance at the cost of lower overall model performance results.

Abstract

Large-scale industrial recommendation models predict the most relevant items from catalogs containing millions or billions of options. To train these models efficiently, a small set of irrelevant items (negative samples) is selected from the vast catalog for each relevant item (positive example), helping the model distinguish between relevant and irrelevant items. Choosing the right negative sampling method is a common challenge. We address this by implementing and comparing various negative sampling methods - random, popularity-based, in-batch, mixed, adaptive, and adaptive with mixed variants - on modern sequential recommendation models. Our experiments, including hyperparameter optimization and 20x repeats on three benchmark datasets with varying popularity biases, show how the choice of method and dataset characteristics impact key model performance metrics. We also reveal that average performance metrics often hide imbalances across popularity bands (head, mid, tail). We find that commonly used random negative sampling reinforces popularity bias and performs best for head items. Popularity-based methods (in-batch and global popularity negative sampling) can offer balanced performance at the cost of lower overall model performance results. Our study serves as a practical guide to the trade-offs in selecting a negative sampling method for large-scale sequential recommendation models. Code, datasets, experimental results and hyperparameters are available at: https://github.com/apple/ml-negative-sampling.

Paper Structure

This paper contains 12 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Structure of the self-attention sequential recommendation (SASRec) model (b) Structure of positive and negative sample tensors, based on different negative sampling methods, where $B$ is the batch size, $S$ is the sequence length, $N$ is the number of negatives, and $E$ is the dimension of the item embedding (c) Global temporal data splitting applied on benchmark datasets to prevent information leakage
  • Figure 2: Popularity-based cohorts in the RetailRocket dataset
  • Figure 3: Histogram of normalized popularity distributions of the public benchmark datasets, MovieLens 10M, Amazon Video and RetailRocket.
  • Figure 4: Average NDCG@10 for all datasets (left) MovieLens 10M, (center) Amazon Beauty, and (right) RetailRocket for 20x runs for each point
  • Figure 5: NDCG@10 on validation data for the ML-10M dataset as a function of learning rate.
  • ...and 1 more figures