Table of Contents
Fetching ...

Adaptive Regularization for Large-Scale Sparse Feature Embedding Models

Mang Li, Wei Lyu

TL;DR

The paper tackles the one-epoch overfitting observed in CTR/CVR estimation models that rely on enormous sparse categorical features. It provides a theoretical explanation via a Rademacher-complexity bound showing that embedding-layer norms and update frequencies drive generalization, and then proposes an adaptive regularization (AdamAR/AdagradAR) that allocates a norm budget to each embedding value according to its update interval. The method computes a per-embedding coefficient $\lambda_{ij}=\min(1, \alpha I_{ij})$ using the local update interval $I_{ij}$ and integrates it into decoupled weight-decay optimizers, reducing overfitting and improving performance within a single epoch. Empirical results across public datasets (iPinYou, Avazu, Amazon, LZD) and multiple backbones (DNN, WDL, DeepFM, WuKong) show consistent improvements in AUC and reduced embedding norms, with robust performance and practical deployment in production systems. The work advances practical regularization for high-cardinality, sparse features and provides a concrete mechanism to balance model fit and generalization in industrial ASR applications.

Abstract

The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, they have not clearly identified the fundamental cause of this phenomenon. In this work, we provide a theoretical analysis that explains why overfitting occurs in models that use large-scale sparse categorical features. Based on this analysis, we propose an adaptive regularization method to address it. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves model performance within a single epoch. This method has already been deployed in online production systems.

Adaptive Regularization for Large-Scale Sparse Feature Embedding Models

TL;DR

The paper tackles the one-epoch overfitting observed in CTR/CVR estimation models that rely on enormous sparse categorical features. It provides a theoretical explanation via a Rademacher-complexity bound showing that embedding-layer norms and update frequencies drive generalization, and then proposes an adaptive regularization (AdamAR/AdagradAR) that allocates a norm budget to each embedding value according to its update interval. The method computes a per-embedding coefficient using the local update interval and integrates it into decoupled weight-decay optimizers, reducing overfitting and improving performance within a single epoch. Empirical results across public datasets (iPinYou, Avazu, Amazon, LZD) and multiple backbones (DNN, WDL, DeepFM, WuKong) show consistent improvements in AUC and reduced embedding norms, with robust performance and practical deployment in production systems. The work advances practical regularization for high-cardinality, sparse features and provides a concrete mechanism to balance model fit and generalization in industrial ASR applications.

Abstract

The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, they have not clearly identified the fundamental cause of this phenomenon. In this work, we provide a theoretical analysis that explains why overfitting occurs in models that use large-scale sparse categorical features. Based on this analysis, we propose an adaptive regularization method to address it. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves model performance within a single epoch. This method has already been deployed in online production systems.

Paper Structure

This paper contains 21 sections, 2 theorems, 14 equations, 3 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

A necessary condition for the optimal regularization multiplier $\lambda^*_{ij}$ associated with the $\|{\bm{e}}_{ij}\|^2 \leq \tau^*_{ij}$ is given by $\lambda^*_{ij}=\mu_0/m_{ij}$, where $\mu_0$ is the Lagrange multiplier corresponding to $\sum_{i=1}^S\sum_{j=1}^{N_i} \tau^*_{ij} \leq C$.

Figures (3)

  • Figure 1: Performance of four methods on Avazu dataset with DNN backbone. (a) shows the training loss curves. (b) presents the test AUC. (c) illustrates the cumulative $\ell_2$ norm of embedding vectors.
  • Figure 2: Performance comparison using various filter ratios for the "IP" feature on the iPinYou dataset. (a) shows the test AUC results. (b) presents the cumulative $\ell_2$ norm of embedding vectors.
  • Figure 3: Performance of different weight decay coefficient on Avazu dataset with DNN backbone at the end of epoch 2. (a) shows the performance with Adam. (b) shows the performance with Adagrad.

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2