On the Embedding Collapse when Scaling up Recommendation Models

Xingzhuo Guo; Junwei Pan; Ximei Wang; Baixu Chen; Jie Jiang; Mingsheng Long

On the Embedding Collapse when Scaling up Recommendation Models

Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, Mingsheng Long

TL;DR

This paper identifies embedding-collapse as a fundamental scalability bottleneck in scaling up recommender models, where embedding matrices become nearly low-rank as model size grows. It introduces Information Abundance and the Interaction-Collapse Theory to explain how feature interaction induces collapse, and shows that simply removing interaction leads to overfitting and poor scalability. To address this, the authors propose Multi-Embedding (ME), a simple design that uses multiple independently trained embedding sets with embedding-set-specific interaction modules to promote embedding diversity and mitigate collapse. Empirical results across multiple baselines and large-scale datasets demonstrate consistent scalability gains and reduced collapse with ME, including a real-world deployment that yielded a substantial GMV lift. The work provides a practical blueprint for scaling recommender systems while preserving higher-order interaction knowledge.

Abstract

Recent advances in foundation models have led to a promising trend of developing large recommendation models to leverage vast amounts of available data. Still, mainstream models remain embarrassingly small in size and naïve enlarging does not lead to sufficient performance gain, suggesting a deficiency in the model scalability. In this paper, we identify the embedding collapse phenomenon as the inhibition of scalability, wherein the embedding matrix tends to occupy a low-dimensional subspace. Through empirical and theoretical analysis, we demonstrate a \emph{two-sided effect} of feature interaction specific to recommendation models. On the one hand, interacting with collapsed embeddings restricts embedding learning and exacerbates the collapse issue. On the other hand, interaction is crucial in mitigating the fitting of spurious features as a scalability guarantee. Based on our analysis, we propose a simple yet effective multi-embedding design incorporating embedding-set-specific interaction modules to learn embedding sets with large diversity and thus reduce collapse. Extensive experiments demonstrate that this proposed design provides consistent scalability and effective collapse mitigation for various recommendation models. Code is available at this repository: https://github.com/thuml/Multi-Embedding.

On the Embedding Collapse when Scaling up Recommendation Models

TL;DR

Abstract

Paper Structure (52 sections, 22 equations, 17 figures, 7 tables)

This paper contains 52 sections, 22 equations, 17 figures, 7 tables.

Introduction
Preliminaries
Embedding Collapse
Feature Interaction Revisited
Interaction-Collapse Theory
Evidence I: Empirical analysis on models with sub-embeddings.
Evidence II: Theoretical analysis on general recommendation models.
Summary: How is collapse caused in recommendation models?
Is It Sufficient to Avoid Collapse for Scalability?
Evidence III: Limiting the modules in interaction that leads to collapse.
Evidence IV: Directly avoiding explicit interaction.
Summary: Does suppressing collapse definitely improve scalability?
Multi-Embedding Design
Multi-Embedding
Experiments
...and 37 more sections

Figures (17)

Figure 1: Unsatisfactory scalability of existing recommendation models. (a): Increasing the embedding size does not improve remarkably or even hurts the model performance. (b): Most embedding matrices do not learn large singular values and tend to be low-rank.
Figure 2: Visualization of information abundance on the Criteo dataset. Fields are sorted by their cardinalities.
Figure 3: Information abundance of sub-embedding matrices for DCNv2, with field indices sorted by information abundance of corresponding raw embedding matrices. Higher or warmer indicates larger. It is observed that $\mathrm{IA}({\bm{E}}_i^{\to j})$ are co-influenced by both $\mathrm{IA}({\bm{E}}_i)$ and $\mathrm{IA}({\bm{E}}_j)$.
Figure 4: $\mathrm{IA}({\bm{E}}_1)$ for toy experiments. "Small" and "Large" refers to the cardinality of $\mathcal{X}_3$.
Figure 5: Experimental results of Evidence III. Restricting DCNv2 leads to higher information abundance, yet the model suffers from over-fitting, thus resulting in non-scalability.
...and 12 more figures

Theorems & Definitions (1)

Definition 3.1: Information Abundance

On the Embedding Collapse when Scaling up Recommendation Models

TL;DR

Abstract

On the Embedding Collapse when Scaling up Recommendation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)

Theorems & Definitions (1)