Table of Contents
Fetching ...

Efficient Hyperparameter Search for Non-Stationary Model Training

Berivan Isik, Matthew Fahrbach, Dima Kuzmin, Nicolas Mayoraz, Emil Praun, Steffen Rendle, Raghavendra Vasudeva

TL;DR

This work addresses the prohibitive cost of hyperparameter search for online non-stationary learning in recommender and advertising systems. It introduces a two-stage paradigm that first rapidly identifies promising configurations and then trains the top candidates to full potential, supported by data reduction (stopping rules, sub-sampling) and prediction strategies (constant, trajectory, stratified) designed for distribution shifts. The approach is validated on the large-scale Criteo 1TB benchmark (up to 10x data reduction) and on a real-world industrial advertising system (up to 2x cost savings), demonstrating both scientific novelty and practical impact. By generalizing the Successive Halving framework to non-stationary settings and exploiting relative performance across configurations, the method enables efficient hyperparameter search at web scale with robust top-k ranking.

Abstract

Online learning is the cornerstone of applications like recommendation and advertising systems, where models continuously adapt to shifting data distributions. Model training for such systems is remarkably expensive, a cost that multiplies during hyperparameter search. We introduce a two-stage paradigm to reduce this cost: (1) efficiently identifying the most promising configurations, and then (2) training only these selected candidates to their full potential. Our core insight is that focusing on accurate identification in the first stage, rather than achieving peak performance, allows for aggressive cost-saving measures. We develop novel data reduction and prediction strategies that specifically overcome the challenges of sequential, non-stationary data not addressed by conventional hyperparameter optimization. We validate our framework's effectiveness through a dual evaluation: first on the Criteo 1TB dataset, the largest suitable public benchmark, and second on an industrial advertising system operating at a scale two orders of magnitude larger. Our methods reduce the total hyperparameter search cost by up to 10$\times$ on the public benchmark and deliver significant, validated efficiency gains in the industrial setting.

Efficient Hyperparameter Search for Non-Stationary Model Training

TL;DR

This work addresses the prohibitive cost of hyperparameter search for online non-stationary learning in recommender and advertising systems. It introduces a two-stage paradigm that first rapidly identifies promising configurations and then trains the top candidates to full potential, supported by data reduction (stopping rules, sub-sampling) and prediction strategies (constant, trajectory, stratified) designed for distribution shifts. The approach is validated on the large-scale Criteo 1TB benchmark (up to 10x data reduction) and on a real-world industrial advertising system (up to 2x cost savings), demonstrating both scientific novelty and practical impact. By generalizing the Successive Halving framework to non-stationary settings and exploiting relative performance across configurations, the method enables efficient hyperparameter search at web scale with robust top-k ranking.

Abstract

Online learning is the cornerstone of applications like recommendation and advertising systems, where models continuously adapt to shifting data distributions. Model training for such systems is remarkably expensive, a cost that multiplies during hyperparameter search. We introduce a two-stage paradigm to reduce this cost: (1) efficiently identifying the most promising configurations, and then (2) training only these selected candidates to their full potential. Our core insight is that focusing on accurate identification in the first stage, rather than achieving peak performance, allows for aggressive cost-saving measures. We develop novel data reduction and prediction strategies that specifically overcome the challenges of sequential, non-stationary data not addressed by conventional hyperparameter optimization. We validate our framework's effectiveness through a dual evaluation: first on the Criteo 1TB dataset, the largest suitable public benchmark, and second on an industrial advertising system operating at a scale two orders of magnitude larger. Our methods reduce the total hyperparameter search cost by up to 10 on the public benchmark and deliver significant, validated efficiency gains in the industrial setting.

Paper Structure

This paper contains 56 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Cluster sizes show high variation over the 24 days of training window.
  • Figure 2: (left) The effect of time variation in sequential non-stationary data on the loss during training. Different configurations (architecture, model size, and other hyperparameters are varied) are all trained on the same 24-day Criteo data and they follow the same time variation pattern. The time variation (e.g. blue line) is significantly higher than the difference between configurations (e.g. green line). (right) Relative loss with respect to a reference configuration. We choose Configuration 5 from (left) as a reference run and plot the other configurations' loss with respect to that.
  • Figure 3: Our proposal (performance-based stopping with stratified prediction on sub-sampled data) in comparison with baselines, (1) basic early stopping and (2) basic sub-sampling. For our proposal, the sub-sampling is for negative-labeled examples only at a fixed rate of $0.5$. For each curve, we vary certain parameters to obtain $\mathop{\mathtt{regret}} @3$ at different $C$, e.g. $\mathcal{T}_{\text{stop}}$ for performance-based stopping (blue), $t_{\text{stop}}$ for basic early stopping (green), and $\lambda^{\text{uniform}}$ for basic sub-sampling (orange).
  • Figure 4: Comparison of one-shot early stopping and performance-based stopping, when used with (left) constant, (center) trajectory, and (right) stratified prediction.
  • Figure 5: Comparison of prediction strategies.
  • ...and 6 more figures