Table of Contents
Fetching ...

Understanding Embedding Scaling in Collaborative Filtering

Yicheng He, Zhou Kaiyu, Haoyue Bai, Fengbin Zhu, Yonghui Yang

TL;DR

The paper tackles why embedding scaling in collaborative filtering does not universally improve performance, uncovering two new scaling regimes—double-peak and logarithmic—that emerge when increasing the embedding size $k$ across 10 datasets and 4 models. Through large-scale experiments and theoretical analysis, it links these phenomena to interaction noise and shows that noise-robust architectures, especially SGL with contrastive learning, achieve more stable scaling. It also proposes a simple denoising approach, BPR_Drop, to mitigate noise effects in traditional models like BPR. The work highlights the role of data quality and architectural robustness in enabling scalable embeddings and motivates future exploration of noise-filtering and Transformer-inspired ideas for CF.

Abstract

Scaling recommendation models into large recommendation models has become one of the most widely discussed topics. Recent efforts focus on components beyond the scaling embedding dimension, as it is believed that scaling embedding may lead to performance degradation. Although there have been some initial observations on embedding, the root cause of their non-scalability remains unclear. Moreover, whether performance degradation occurs across different types of models and datasets is still an unexplored area. Regarding the effect of embedding dimensions on performance, we conduct large-scale experiments across 10 datasets with varying sparsity levels and scales, using 4 representative classical architectures. We surprisingly observe two novel phenomena: double-peak and logarithmic. For the former, as the embedding dimension increases, performance first improves, then declines, rises again, and eventually drops. For the latter, it exhibits a perfect logarithmic curve. Our contributions are threefold. First, we discover two novel phenomena when scaling collaborative filtering models. Second, we gain an understanding of the underlying causes of the double-peak phenomenon. Lastly, we theoretically analyze the noise robustness of collaborative filtering models, with results matching empirical observations.

Understanding Embedding Scaling in Collaborative Filtering

TL;DR

The paper tackles why embedding scaling in collaborative filtering does not universally improve performance, uncovering two new scaling regimes—double-peak and logarithmic—that emerge when increasing the embedding size across 10 datasets and 4 models. Through large-scale experiments and theoretical analysis, it links these phenomena to interaction noise and shows that noise-robust architectures, especially SGL with contrastive learning, achieve more stable scaling. It also proposes a simple denoising approach, BPR_Drop, to mitigate noise effects in traditional models like BPR. The work highlights the role of data quality and architectural robustness in enabling scalable embeddings and motivates future exploration of noise-filtering and Transformer-inspired ideas for CF.

Abstract

Scaling recommendation models into large recommendation models has become one of the most widely discussed topics. Recent efforts focus on components beyond the scaling embedding dimension, as it is believed that scaling embedding may lead to performance degradation. Although there have been some initial observations on embedding, the root cause of their non-scalability remains unclear. Moreover, whether performance degradation occurs across different types of models and datasets is still an unexplored area. Regarding the effect of embedding dimensions on performance, we conduct large-scale experiments across 10 datasets with varying sparsity levels and scales, using 4 representative classical architectures. We surprisingly observe two novel phenomena: double-peak and logarithmic. For the former, as the embedding dimension increases, performance first improves, then declines, rises again, and eventually drops. For the latter, it exhibits a perfect logarithmic curve. Our contributions are threefold. First, we discover two novel phenomena when scaling collaborative filtering models. Second, we gain an understanding of the underlying causes of the double-peak phenomenon. Lastly, we theoretically analyze the noise robustness of collaborative filtering models, with results matching empirical observations.

Paper Structure

This paper contains 31 sections, 7 theorems, 46 equations, 7 figures, 3 tables.

Key Result

Theorem 1

Let $\Theta_0$ be the minimizer of the clean empirical loss $\mathbb{E}_{x \sim \mathcal{D}_0}[\ell(\Theta; x)]$, and consider a noisy interaction distribution $\mathcal{D}_\delta := (1 - \delta) \mathcal{D}_0 + \delta \mathcal{N}$, where $\mathcal{N}$ is a noise distribution and $\delta \in [0,1]$ where $\Theta_\delta$ is the minimizer under $\mathcal{D}_\delta$, and $H := \nabla^2_\Theta \mathb

Figures (7)

  • Figure 1: Double-peak: first rising, then falling, followed by another rise and decline; logarithmic, performance follows a logarithmic increase.
  • Figure 2: Scale the embedding dimension exponentially by a factor of 2 across different collaborative filtering models and datasets. Each row corresponds to a model (BPR, LightGCN, SGL, NeuMF), and each column represents a dataset (Modcloth, Douban, ML-100k).
  • Figure 3: Comparing the standard BPR with the denoising strategy-based BPR_Drop in ML-100K and Douban, we can clearly observe that the performance degradation has significantly improved.
  • Figure 4: Scale the embedding dimension exponentially by a factor of 2 across different collaborative filtering models and datasets. Each row corresponds to a model (BPR, LightGCN, SGL, NeuMF), and each column represents a dataset (Amazon Beauty, Amazon Baby, Amazon Books).
  • Figure 5: Scale the embedding dimension exponentially by a factor of 2 across different collaborative filtering models and datasets. Each row corresponds to a model (BPR, LightGCN, SGL, NeuMF), and each column represents a dataset (Yelp, Gowalla, Pinterest).
  • ...and 2 more figures

Theorems & Definitions (14)

  • Definition 1: Sparse Double Descent
  • Definition 2: Representation Quality
  • Theorem 1: Representation Perturbation Under Noisy Interactions
  • Remark 1
  • Lemma 1: High Gradient Sensitivity in BPR
  • Remark 2
  • Lemma 2: Noise-Induced Gradient Amplification
  • Theorem 2: Gradient Instability in NeuMF
  • Corollary 1: Comparative Noise Sensitivity: NeuMF vs. BPR
  • Definition 3: Spectral Representation of Graph Convolution
  • ...and 4 more