Table of Contents
Fetching ...

Scaled Supervision is an Implicit Lipschitz Regularizer

Zhongyu Ouyang, Chunhui Zhang, Yaning Jia, Soroush Vosoughi

TL;DR

The paper addresses overfitting in CTR models caused by thresholding and rapidly changing online contexts. It proposes scaled, fine-grained supervision in the form of ratings to act as an implicit Lipschitz regularizer, deriving a bound $L_p(N) \le L_f / \sqrt{N}$ for temperature-scaled softmax outputs. Empirically, this approach improves predictive and ranking performance across multiple baselines and datasets with minimal latency and hyperparameter tuning. The work demonstrates that boosting supervision bandwidth enhances stability and generalization in CTR models, with broad implications for robust recommender systems and related architectures.

Abstract

In modern social media, recommender systems (RecSys) rely on the click-through rate (CTR) as the standard metric to evaluate user engagement. CTR prediction is traditionally framed as a binary classification task to predict whether a user will interact with a given item. However, this approach overlooks the complexity of real-world social modeling, where the user, item, and their interactive features change dynamically in fast-paced online environments. This dynamic nature often leads to model instability, reflected in overfitting short-term fluctuations rather than higher-level interactive patterns. While overfitting calls for more scaled and refined supervisions, current solutions often rely on binary labels that overly simplify fine-grained user preferences through the thresholding process, which significantly reduces the richness of the supervision. Therefore, we aim to alleviate the overfitting problem by increasing the supervision bandwidth in CTR training. Specifically, (i) theoretically, we formulate the impact of fine-grained preferences on model stability as a Lipschitz constrain; (ii) empirically, we discover that scaling the supervision bandwidth can act as an implicit Lipschitz regularizer, stably optimizing existing CTR models to achieve better generalizability. Extensive experiments show that this scaled supervision significantly and consistently improves the optimization process and the performance of existing CTR models, even without the need for additional hyperparameter tuning.

Scaled Supervision is an Implicit Lipschitz Regularizer

TL;DR

The paper addresses overfitting in CTR models caused by thresholding and rapidly changing online contexts. It proposes scaled, fine-grained supervision in the form of ratings to act as an implicit Lipschitz regularizer, deriving a bound for temperature-scaled softmax outputs. Empirically, this approach improves predictive and ranking performance across multiple baselines and datasets with minimal latency and hyperparameter tuning. The work demonstrates that boosting supervision bandwidth enhances stability and generalization in CTR models, with broad implications for robust recommender systems and related architectures.

Abstract

In modern social media, recommender systems (RecSys) rely on the click-through rate (CTR) as the standard metric to evaluate user engagement. CTR prediction is traditionally framed as a binary classification task to predict whether a user will interact with a given item. However, this approach overlooks the complexity of real-world social modeling, where the user, item, and their interactive features change dynamically in fast-paced online environments. This dynamic nature often leads to model instability, reflected in overfitting short-term fluctuations rather than higher-level interactive patterns. While overfitting calls for more scaled and refined supervisions, current solutions often rely on binary labels that overly simplify fine-grained user preferences through the thresholding process, which significantly reduces the richness of the supervision. Therefore, we aim to alleviate the overfitting problem by increasing the supervision bandwidth in CTR training. Specifically, (i) theoretically, we formulate the impact of fine-grained preferences on model stability as a Lipschitz constrain; (ii) empirically, we discover that scaling the supervision bandwidth can act as an implicit Lipschitz regularizer, stably optimizing existing CTR models to achieve better generalizability. Extensive experiments show that this scaled supervision significantly and consistently improves the optimization process and the performance of existing CTR models, even without the need for additional hyperparameter tuning.

Paper Structure

This paper contains 15 sections, 1 theorem, 28 equations, 5 figures, 4 tables.

Key Result

Theorem 1

Let $\mathbf{f}(\mathbf{x}) = [f_1(\mathbf{x}), \dots, f_N(\mathbf{x})]$ be the logits output by a model for an input $\mathbf{x} \in \mathcal{X} \subset \mathbb{R}^D$, where $N$ is the dimension of the output logits. Let $\mathbf{p}(\mathbf{x}) = \sigma(\mathbf{f}(\mathbf{x}))$ denote the correspon

Figures (5)

  • Figure 1: (a) and (b) demonstrate the existing and our modified paradigm of training CTR models with explicit feedback respectively. The input includes user, item IDs and the contextual information of the interaction in between. Our paradigm enlarges the supervision bandwidth from explicit preferences: we slightly modify existing CTR models' prediction layers to recover the explicit preferences for more refined supervision. These logits are then normalized via the Softmax function to the probability that matches the CTR prediction task.
  • Figure 2: The L2 gradient norms during training and the weight norms post-training for the learned user and item embeddings in the original DCNV2, AutoInt, and FiGNN methods, and those after applied with our approach.
  • Figure 3: The changes in AUC and logloss when perturbations are applied to the input embeddings of the original CTR methods and those after applying our method. Left y-axis represents AUC ($\uparrow$) and right y-axis corresponds to logloss ($\downarrow$).
  • Figure 4: The AUC score of DCNV2 wang2021dcn, WideDeep cheng2016wide, and xDeepFM lian2018xdeepfm on the ML-1M, Yelp2018, and Amazon-book dataset with different ratio $\lambda_r$ in training.
  • Figure 5: A concrete example for end user experience improvement in recommendation. Compared with the recommendation list generated by the original model (left side), with improved generality, our models generates recommended items (right side) that is both accurate (i.e., contents are preferred) and personalized (i.e., more interested items are ranked in the front).

Theorems & Definitions (3)

  • Theorem 1
  • proof : Proof Sketch
  • proof