Table of Contents
Fetching ...

Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

Yun Yue, Yongchao Liu, Suo Tong, Minghao Li, Zhen Zhang, Chunyang Wen, Huanjun Bao, Lihong Gu, Jinjie Gu, Yixiang Mu

TL;DR

This work introduces a framework that integrates sparse group lasso regularization into a broad family of adaptive optimizers for neural networks, enabling direct sparsity in DNNs without post-processing. It derives a closed-form ada-group-lasso update and shows how Group Adam and Group Adagrad extend existing optimizers while preserving their baseline behavior when regularizers are zero. The authors establish regret bounds and convergence rates in the online convex optimization setting, proving an $O(\sqrt{T})$ regret under suitable conditions. Empirically, the proposed Group variants outperform their vanilla counterparts at equivalent sparsity on three large CTR datasets, achieving substantial sparsity with competitive or improved AUC, and enabling high sparsity with strong performance. The work provides publicly available code and highlights practical insights for hyperparameter choices and embedding-dimension effects in sparse CTR models.

Abstract

We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance. The code is available at https://github.com/intelligent-machine-learning/tfplus/tree/main/tfplus.

Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

TL;DR

This work introduces a framework that integrates sparse group lasso regularization into a broad family of adaptive optimizers for neural networks, enabling direct sparsity in DNNs without post-processing. It derives a closed-form ada-group-lasso update and shows how Group Adam and Group Adagrad extend existing optimizers while preserving their baseline behavior when regularizers are zero. The authors establish regret bounds and convergence rates in the online convex optimization setting, proving an regret under suitable conditions. Empirically, the proposed Group variants outperform their vanilla counterparts at equivalent sparsity on three large CTR datasets, achieving substantial sparsity with competitive or improved AUC, and enabling high sparsity with strong performance. The work provides publicly available code and highlights practical insights for hyperparameter choices and embedding-dimension effects in sparse CTR models.

Abstract

We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance. The code is available at https://github.com/intelligent-machine-learning/tfplus/tree/main/tfplus.

Paper Structure

This paper contains 26 sections, 7 theorems, 46 equations, 5 figures, 9 tables, 3 algorithms.

Key Result

theorem thmcountertheorem

Given $A_t = (\sum_{s=1}^{t}\frac{Q_s^g}{2\alpha_{s}} + \lambda_{2}\mathbb{I})$ of Eq. eq:sgl, $z_t = z_{t-1} + m_t - \frac{Q_t}{\alpha_t}x_t$ at each iteration $t = 1, \dots, T$ and $z_0 = \mathbf{0}$, the optimal solution of Eq. eq:ada-sgl is updated accordingly as follows: where the $i$-th element of $s_t$ is defined as $\tilde{s}_t$ is defined as and $\sum_{s=1}^{t} \frac{Q_s}{\alpha_{s}}$

Figures (5)

  • Figure 1: AUC across different sparsity on two optimizers for the three datasets. MLP, OPNN and DCN are in left, middle, right column respectively. The x-axis is sparsity (number of non-zero features whose embedding vectors are not equal to $0$ divided by the total number of features present in the training data). The y-axis is AUC. Error bars represent one standard deviation.
  • Figure 2: AUC across different sparsity (feature rate) on two methods. The legend is the algorithms using $s_t$ and $\tilde{s}_t$. The x-axis is sparsity. The y-axis is AUC.
  • Figure 3: The sparsity (feature rate) across different values of regularized terms. The legend is the regularized terms. The x-axis is the values of regularized terms. The y-axis is sparsity.
  • Figure : Group Adam
  • Figure : The regularization terms of Group Adam of three datasets.

Theorems & Definitions (12)

  • theorem thmcountertheorem
  • theorem thmcountertheorem
  • theorem thmcountertheorem
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • corollary thmcountercorollary
  • corollary thmcountercorollary
  • proof
  • proof
  • proof
  • ...and 2 more