Adaptive Regularization for Large-Scale Sparse Feature Embedding Models
Mang Li, Wei Lyu
TL;DR
The paper tackles the one-epoch overfitting observed in CTR/CVR estimation models that rely on enormous sparse categorical features. It provides a theoretical explanation via a Rademacher-complexity bound showing that embedding-layer norms and update frequencies drive generalization, and then proposes an adaptive regularization (AdamAR/AdagradAR) that allocates a norm budget to each embedding value according to its update interval. The method computes a per-embedding coefficient $\lambda_{ij}=\min(1, \alpha I_{ij})$ using the local update interval $I_{ij}$ and integrates it into decoupled weight-decay optimizers, reducing overfitting and improving performance within a single epoch. Empirical results across public datasets (iPinYou, Avazu, Amazon, LZD) and multiple backbones (DNN, WDL, DeepFM, WuKong) show consistent improvements in AUC and reduced embedding norms, with robust performance and practical deployment in production systems. The work advances practical regularization for high-cardinality, sparse features and provides a concrete mechanism to balance model fit and generalization in industrial ASR applications.
Abstract
The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, they have not clearly identified the fundamental cause of this phenomenon. In this work, we provide a theoretical analysis that explains why overfitting occurs in models that use large-scale sparse categorical features. Based on this analysis, we propose an adaptive regularization method to address it. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves model performance within a single epoch. This method has already been deployed in online production systems.
