Table of Contents
Fetching ...

Incorporating Surrogate Gradient Norm to Improve Offline Optimization Techniques

Manh Cuong Dao, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang

TL;DR

The paper tackles offline optimization by learning surrogates from fixed data and optimizing over inputs, a setting plagued by out-of-distribution errors. It introduces generalized surrogate sharpness and shows a tractable gradient-norm proxy to regularize surrogate training, yielding the IGNITE family of constrained optimizers that empirically improve performance across diverse design tasks. Theoretical results bound the unseen-data sharpness by the empirical sharpness, providing PAC-Bayes–style guarantees that underpin the regularizer. Empirically, IGNITE and its variant achieve up to 9.6% gains over strong baselines, with broad improvements and reasonable compute overhead, and open-source code is provided for reproducibility.

Abstract

Offline optimization has recently emerged as an increasingly popular approach to mitigate the prohibitively expensive cost of online experimentation. The key idea is to learn a surrogate of the black-box function that underlines the target experiment using a static (offline) dataset of its previous input-output queries. Such an approach is, however, fraught with an out-of-distribution issue where the learned surrogate becomes inaccurate outside the offline data regimes. To mitigate this, existing offline optimizers have proposed numerous conditioning techniques to prevent the learned surrogate from being too erratic. Nonetheless, such conditioning strategies are often specific to particular surrogate or search models, which might not generalize to a different model choice. This motivates us to develop a model-agnostic approach instead, which incorporates a notion of model sharpness into the training loss of the surrogate as a regularizer. Our approach is supported by a new theoretical analysis demonstrating that reducing surrogate sharpness on the offline dataset provably reduces its generalized sharpness on unseen data. Our analysis extends existing theories from bounding generalized prediction loss (on unseen data) with loss sharpness to bounding the worst-case generalized surrogate sharpness with its empirical estimate on training data, providing a new perspective on sharpness regularization. Our extensive experimentation on a diverse range of optimization tasks also shows that reducing surrogate sharpness often leads to significant improvement, marking (up to) a noticeable 9.6% performance boost. Our code is publicly available at https://github.com/cuong-dm/IGNITE

Incorporating Surrogate Gradient Norm to Improve Offline Optimization Techniques

TL;DR

The paper tackles offline optimization by learning surrogates from fixed data and optimizing over inputs, a setting plagued by out-of-distribution errors. It introduces generalized surrogate sharpness and shows a tractable gradient-norm proxy to regularize surrogate training, yielding the IGNITE family of constrained optimizers that empirically improve performance across diverse design tasks. Theoretical results bound the unseen-data sharpness by the empirical sharpness, providing PAC-Bayes–style guarantees that underpin the regularizer. Empirically, IGNITE and its variant achieve up to 9.6% gains over strong baselines, with broad improvements and reasonable compute overhead, and open-source code is provided for reproducibility.

Abstract

Offline optimization has recently emerged as an increasingly popular approach to mitigate the prohibitively expensive cost of online experimentation. The key idea is to learn a surrogate of the black-box function that underlines the target experiment using a static (offline) dataset of its previous input-output queries. Such an approach is, however, fraught with an out-of-distribution issue where the learned surrogate becomes inaccurate outside the offline data regimes. To mitigate this, existing offline optimizers have proposed numerous conditioning techniques to prevent the learned surrogate from being too erratic. Nonetheless, such conditioning strategies are often specific to particular surrogate or search models, which might not generalize to a different model choice. This motivates us to develop a model-agnostic approach instead, which incorporates a notion of model sharpness into the training loss of the surrogate as a regularizer. Our approach is supported by a new theoretical analysis demonstrating that reducing surrogate sharpness on the offline dataset provably reduces its generalized sharpness on unseen data. Our analysis extends existing theories from bounding generalized prediction loss (on unseen data) with loss sharpness to bounding the worst-case generalized surrogate sharpness with its empirical estimate on training data, providing a new perspective on sharpness regularization. Our extensive experimentation on a diverse range of optimization tasks also shows that reducing surrogate sharpness often leads to significant improvement, marking (up to) a noticeable 9.6% performance boost. Our code is publicly available at https://github.com/cuong-dm/IGNITE

Paper Structure

This paper contains 34 sections, 5 theorems, 61 equations, 8 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

There exists $\tau > 0$ and $\boldsymbol{\omega}_+$, and a non-linear function $r(\mathbf{x}; \boldsymbol{\omega})$ such that, satisfies Assumption 2 and is bounded on $\{\boldsymbol{\omega} \mid \|\boldsymbol{\omega}\| \leq \tau\}$. Detailed derivation of this theorem is deferred to Appendix app:e.

Figures (8)

  • Figure 1: (a) Illustration of surrogate sharpness; (b) Illustration of surrogate sharpness-based offline optimization: Consider two surrogate parameters $\boldsymbol{\omega}_1$ and $\boldsymbol{\omega}_2$ where $\boldsymbol{\omega}_1$ has a smaller sharpness than $\boldsymbol{\omega}_2$. This means the predictions of the models in the perturbation neighborhood of $\boldsymbol{\omega}_1$ will vary less than those of the models in the perturbation neighborhood of $\boldsymbol{\omega}_2$. As such, if both neighborhoods contain the oracle, the prediction error $d_1$ of $\boldsymbol{\omega}_1$ is potentially smaller than the prediction error $d_2$ of $\boldsymbol{\omega}_2$. Consequently, the optimal value of $g(\mathbf{x};\boldsymbol{\omega}_1)$ is closer to the oracle optimal value than $g(\mathbf{x};\boldsymbol{\omega}_2)$'s.
  • Figure 1: The percentage improvement in performance achieved by IGNITE across all tasks and baseline algorithms at the 100th percentile level is presented. Gain signifies the percentage gain over the baseline performance (Base).
  • Figure 2: The percentage improvement in performance achieved by IGNITE across different algorithms (COMS and GA) and tasks (ANT and TF10) in the changes of (a) threshold $\epsilon$ and (b) step size $\eta_\lambda$.
  • Figure 3: Performance vs. the no. of gradient ascent steps during optimization of IGNITE-2 and Baseline optimized algorithms, e.g, COMs and GA.
  • Figure 4: Performance variation of COMS and GA (regularized by IGNITE-2) in the change of the regularization coefficient $\lambda \in [0.0001, 0.001, 0.01]$.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Lemma 5