Table of Contents
Fetching ...

Diversity-Augmented Negative Sampling for Implicit Collaborative Filtering

Yueqing Xuan, Kacper Sokol, Mark Sanderson, Jeffrey Chan

TL;DR

This paper tackles the problem of ineffective negative sampling in implicit collaborative filtering due to sampling from dense regions of the item space. It introduces Diverse Negative Sampling (DivNS), a three-component framework that (i) builds user-specific caches of informative negatives, (ii) applies diversity-augmented $k$-DPP sampling to select a diverse subset from the cache while penalizing similarity to hard negatives, and (iii) synthesizes negatives by mixing hard and diverse negatives. Theoretical and empirical analyses show traditional sampling overlooks diversity, leading to biased training; DivNS improves generalisation and ranking quality across four diverse datasets, with modest computational overhead. The approach is validated across MF and LightGCN with consistent gains, and ablation studies confirm the critical roles of caching, diversity-driven sampling, and synthetic negatives. Overall, DivNS offers a practical, metadata-free, scalable method to enhance negative sampling by promoting informative and diverse training signals in implicit CF.

Abstract

Recommenders built upon implicit collaborative filtering are typically trained to distinguish between users' positive and negative preferences. When direct observations of the latter are unavailable, negative training data are constructed with sampling techniques. But since items often exhibit clustering in the latent space, existing methods tend to oversample negatives from dense regions, resulting in homogeneous training data and limited model expressiveness. To address these shortcomings, we propose a novel negative sampler with diversity guarantees. To achieve them, our approach first pairs each positive item of a user with one that they have not yet interacted with; this instance, called hard negative, is chosen as the top-scoring item according to the model. Instead of discarding the remaining highly informative items, we store them in a user-specific cache. Next, our diversity-augmented sampler selects a representative subset of negatives from the cache, ensuring its dissimilarity from the corresponding user's hard negatives. Our generator then combines these items with the hard negatives, replacing them to produce more effective (synthetic) negative training data that are informative and diverse. Experiments show that our method consistently leads to superior recommendation quality without sacrificing computational efficiency.

Diversity-Augmented Negative Sampling for Implicit Collaborative Filtering

TL;DR

This paper tackles the problem of ineffective negative sampling in implicit collaborative filtering due to sampling from dense regions of the item space. It introduces Diverse Negative Sampling (DivNS), a three-component framework that (i) builds user-specific caches of informative negatives, (ii) applies diversity-augmented -DPP sampling to select a diverse subset from the cache while penalizing similarity to hard negatives, and (iii) synthesizes negatives by mixing hard and diverse negatives. Theoretical and empirical analyses show traditional sampling overlooks diversity, leading to biased training; DivNS improves generalisation and ranking quality across four diverse datasets, with modest computational overhead. The approach is validated across MF and LightGCN with consistent gains, and ablation studies confirm the critical roles of caching, diversity-driven sampling, and synthetic negatives. Overall, DivNS offers a practical, metadata-free, scalable method to enhance negative sampling by promoting informative and diverse training signals in implicit CF.

Abstract

Recommenders built upon implicit collaborative filtering are typically trained to distinguish between users' positive and negative preferences. When direct observations of the latter are unavailable, negative training data are constructed with sampling techniques. But since items often exhibit clustering in the latent space, existing methods tend to oversample negatives from dense regions, resulting in homogeneous training data and limited model expressiveness. To address these shortcomings, we propose a novel negative sampler with diversity guarantees. To achieve them, our approach first pairs each positive item of a user with one that they have not yet interacted with; this instance, called hard negative, is chosen as the top-scoring item according to the model. Instead of discarding the remaining highly informative items, we store them in a user-specific cache. Next, our diversity-augmented sampler selects a representative subset of negatives from the cache, ensuring its dissimilarity from the corresponding user's hard negatives. Our generator then combines these items with the hard negatives, replacing them to produce more effective (synthetic) negative training data that are informative and diverse. Experiments show that our method consistently leads to superior recommendation quality without sacrificing computational efficiency.

Paper Structure

This paper contains 42 sections, 1 theorem, 10 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

proposition 1

Two-stage -- i.e., sampling followed by ranking -- NS methods that employ uniform sampling as their foundation cannot guarantee maximum item diversity measured in the latent space.

Figures (7)

  • Figure 1: Toy example showing the distribution of $k=15$ negative items ($\star$) drawn through (\ref{['fig:random_sampling']}) uniform and (\ref{['fig:dpp_sampling']}) $k$-Determinantal Point Process ($k$-DPP) sampling. Item clusters in the embedding space are marked with different colours.
  • Figure 2: Illustration of DivNS. Epoch 0 is used for initialisation; subsequent epochs -- Epoch 1, …, e -- are identical.
  • Figure 3: Diversity of sampled negatives for the top four performing samplers deployed with LightGCN.
  • Figure 4: Impact of synthetic negatives generation on performance -- NDCG@20 and Recall@20 -- of LightGCN for (\ref{['fig:pinterest-lambda']}) Pinterest and (\ref{['fig:yelp-lambda']}) Yelp 2022. Appendix \ref{['app:add-results']} lists full results.
  • Figure 5: Visualisation of item embeddings for two benchmark datasets -- (\ref{['fig:dataset-tsne:p']}) Pinterest and (\ref{['fig:dataset-tsne:y']}) Yelp 2022 -- using t-SNE. A clear cluster structure can be observed in both cases.
  • ...and 2 more figures

Theorems & Definitions (1)

  • proposition 1