Table of Contents
Fetching ...

Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It

Yuta Saito, Masahiro Nomura

TL;DR

The paper tackles the risk that hyperparameter optimization (HPO) can hurt off-policy learning (OPL) when the surrogate objective used for validation overestimates generalization performance. It combines empirical and theoretical analyses to show optimistic bias and unsafe behavior in standard HPO with off-policy estimators like IPS, and then introduces CIR-HPO, a simple method incorporating Conservative Surrogate Objective (CSO) and Adaptive Imitation Regularization (AIR). CIR-HPO reduces overestimation-driven regret and prevents degradation below the logging policy, delivering more robust generalization in both synthetic and real Open Bandit Dataset experiments. The work provides practical guidance for safe HPO in OPL and contributes to more reliable deployment of learned policies in domain applications such as recommender systems and personalized medicine.

Abstract

There has been a growing interest in off-policy evaluation in the literature such as recommender systems and personalized medicine. We have so far seen significant progress in developing estimators aimed at accurately estimating the effectiveness of counterfactual policies based on biased logged data. However, there are many cases where those estimators are used not only to evaluate the value of decision making policies but also to search for the best hyperparameters from a large candidate space. This work explores the latter hyperparameter optimization (HPO) task for off-policy learning. We empirically show that naively applying an unbiased estimator of the generalization performance as a surrogate objective in HPO can cause an unexpected failure, merely pursuing hyperparameters whose generalization performance is greatly overestimated. We then propose simple and computationally efficient corrections to the typical HPO procedure to deal with the aforementioned issues simultaneously. Empirical investigations demonstrate the effectiveness of our proposed HPO algorithm in situations where the typical procedure fails severely.

Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It

TL;DR

The paper tackles the risk that hyperparameter optimization (HPO) can hurt off-policy learning (OPL) when the surrogate objective used for validation overestimates generalization performance. It combines empirical and theoretical analyses to show optimistic bias and unsafe behavior in standard HPO with off-policy estimators like IPS, and then introduces CIR-HPO, a simple method incorporating Conservative Surrogate Objective (CSO) and Adaptive Imitation Regularization (AIR). CIR-HPO reduces overestimation-driven regret and prevents degradation below the logging policy, delivering more robust generalization in both synthetic and real Open Bandit Dataset experiments. The work provides practical guidance for safe HPO in OPL and contributes to more reliable deployment of learned policies in domain applications such as recommender systems and personalized medicine.

Abstract

There has been a growing interest in off-policy evaluation in the literature such as recommender systems and personalized medicine. We have so far seen significant progress in developing estimators aimed at accurately estimating the effectiveness of counterfactual policies based on biased logged data. However, there are many cases where those estimators are used not only to evaluate the value of decision making policies but also to search for the best hyperparameters from a large candidate space. This work explores the latter hyperparameter optimization (HPO) task for off-policy learning. We empirically show that naively applying an unbiased estimator of the generalization performance as a surrogate objective in HPO can cause an unexpected failure, merely pursuing hyperparameters whose generalization performance is greatly overestimated. We then propose simple and computationally efficient corrections to the typical HPO procedure to deal with the aforementioned issues simultaneously. Empirical investigations demonstrate the effectiveness of our proposed HPO algorithm in situations where the typical procedure fails severely.
Paper Structure (32 sections, 3 theorems, 25 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 32 sections, 3 theorems, 25 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Proposition 3.1

Given that $\hat{V}$ is unbiased, we have the following inequalities. where $\mathbb{E}_{\mathcal{D}}[\cdot]$ takes expectation over every randomness in the logged data $\mathcal{D}$, and $\mathbb{E}_{\mathcal{D}} [\hat{V} (\hat{\theta}(\mathcal{D}); \mathcal{D}) ] - \mathbb{E}_{\mathcal{D}} [ V ( \hat{\theta}(\mathcal{D}) ) ]$ is the amount of optimistic bias.

Figures (11)

  • Figure 1: Empirical Evidence of Optimistic Bias and Unsafe Behavior in HPO for OPL (w/ TPE). The results are averaged over 25 runs with different seeds and then normalized by $V(\pi_0)$. The shaded regions indicate 95% confidence intervals.
  • Figure 2: Distributions of Overestimation Bias ($\beta_0=3$)
  • Figure 3: Comparing CIR-HPO (our proposal) and Baseline by their generalization performance. The results are averaged over 25 runs with different seeds and then normalized by $V(\pi_0)$. The shaded regions indicate 95% confidence intervals.
  • Figure 4: Behavior of adaptive regularization parameter ($\alpha_t$) of CIR-HPO with varying values of $\beta_0 \in \{-3,0,3,10,20\}$.
  • Figure 5: Sensitivity of the generalization performance of CIR-HPO regarding the choice of $\delta$.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Proposition 3.1
  • Proposition 3.2
  • proof
  • proof
  • Proposition D.1
  • proof