Diversity-Augmented Negative Sampling for Implicit Collaborative Filtering
Yueqing Xuan, Kacper Sokol, Mark Sanderson, Jeffrey Chan
TL;DR
This paper tackles the problem of ineffective negative sampling in implicit collaborative filtering due to sampling from dense regions of the item space. It introduces Diverse Negative Sampling (DivNS), a three-component framework that (i) builds user-specific caches of informative negatives, (ii) applies diversity-augmented $k$-DPP sampling to select a diverse subset from the cache while penalizing similarity to hard negatives, and (iii) synthesizes negatives by mixing hard and diverse negatives. Theoretical and empirical analyses show traditional sampling overlooks diversity, leading to biased training; DivNS improves generalisation and ranking quality across four diverse datasets, with modest computational overhead. The approach is validated across MF and LightGCN with consistent gains, and ablation studies confirm the critical roles of caching, diversity-driven sampling, and synthetic negatives. Overall, DivNS offers a practical, metadata-free, scalable method to enhance negative sampling by promoting informative and diverse training signals in implicit CF.
Abstract
Recommenders built upon implicit collaborative filtering are typically trained to distinguish between users' positive and negative preferences. When direct observations of the latter are unavailable, negative training data are constructed with sampling techniques. But since items often exhibit clustering in the latent space, existing methods tend to oversample negatives from dense regions, resulting in homogeneous training data and limited model expressiveness. To address these shortcomings, we propose a novel negative sampler with diversity guarantees. To achieve them, our approach first pairs each positive item of a user with one that they have not yet interacted with; this instance, called hard negative, is chosen as the top-scoring item according to the model. Instead of discarding the remaining highly informative items, we store them in a user-specific cache. Next, our diversity-augmented sampler selects a representative subset of negatives from the cache, ensuring its dissimilarity from the corresponding user's hard negatives. Our generator then combines these items with the hard negatives, replacing them to produce more effective (synthetic) negative training data that are informative and diverse. Experiments show that our method consistently leads to superior recommendation quality without sacrificing computational efficiency.
