Table of Contents
Fetching ...

Robust Gaussian Processes via Relevance Pursuit

Sebastian Ament, Elizabeth Santorella, David Eriksson, Ben Letham, Maximilian Balandat, Eytan Bakshy

TL;DR

This work addresses robustness of Gaussian Processes to sparse label corruptions by introducing Robust Gaussian Processes via Relevance Pursuit (RRP), which learns data-point-specific noise variances ρ and uses a greedy relevance-pursuit strategy to identify a sparse set of outliers. A key theoretical contribution is a convex reparameterization ρ(s) that yields strong convexity and smoothness of the negative marginal log-likelihood, enabling approximation guarantees for the subset-selection process through generalized orthogonal matching pursuit. The framework supports automatic outlier detection via Bayesian model selection and remains compatible with arbitrary kernels and mean functions, providing competitive performance in both regression and Bayesian optimization under sparse corruptions. Empirically, RRP demonstrates robustness to diverse corruption regimes (constant, uniform, asymmetric, focused) and offers favorable computation times relative to heavy-tailed alternatives, while delivering principled uncertainty estimates. Overall, the method advances robust GP learning with theoretical guarantees and practical applicability to BO and related tasks where data integrity is imperfect.

Abstract

Gaussian processes (GPs) are non-parametric probabilistic regression models that are popular due to their flexibility, data efficiency, and well-calibrated uncertainty estimates. However, standard GP models assume homoskedastic Gaussian noise, while many real-world applications are subject to non-Gaussian corruptions. Variants of GPs that are more robust to alternative noise models have been proposed, and entail significant trade-offs between accuracy and robustness, and between computational requirements and theoretical guarantees. In this work, we propose and study a GP model that achieves robustness against sparse outliers by inferring data-point-specific noise levels with a sequential selection procedure maximizing the log marginal likelihood that we refer to as relevance pursuit. We show, surprisingly, that the model can be parameterized such that the associated log marginal likelihood is strongly concave in the data-point-specific noise variances, a property rarely found in either robust regression objectives or GP marginal likelihoods. This in turn implies the weak submodularity of the corresponding subset selection problem, and thereby proves approximation guarantees for the proposed algorithm. We compare the model's performance relative to other approaches on diverse regression and Bayesian optimization tasks, including the challenging but common setting of sparse corruptions of the labels within or close to the function range.

Robust Gaussian Processes via Relevance Pursuit

TL;DR

This work addresses robustness of Gaussian Processes to sparse label corruptions by introducing Robust Gaussian Processes via Relevance Pursuit (RRP), which learns data-point-specific noise variances ρ and uses a greedy relevance-pursuit strategy to identify a sparse set of outliers. A key theoretical contribution is a convex reparameterization ρ(s) that yields strong convexity and smoothness of the negative marginal log-likelihood, enabling approximation guarantees for the subset-selection process through generalized orthogonal matching pursuit. The framework supports automatic outlier detection via Bayesian model selection and remains compatible with arbitrary kernels and mean functions, providing competitive performance in both regression and Bayesian optimization under sparse corruptions. Empirically, RRP demonstrates robustness to diverse corruption regimes (constant, uniform, asymmetric, focused) and offers favorable computation times relative to heavy-tailed alternatives, while delivering principled uncertainty estimates. Overall, the method advances robust GP learning with theoretical guarantees and practical applicability to BO and related tasks where data integrity is imperfect.

Abstract

Gaussian processes (GPs) are non-parametric probabilistic regression models that are popular due to their flexibility, data efficiency, and well-calibrated uncertainty estimates. However, standard GP models assume homoskedastic Gaussian noise, while many real-world applications are subject to non-Gaussian corruptions. Variants of GPs that are more robust to alternative noise models have been proposed, and entail significant trade-offs between accuracy and robustness, and between computational requirements and theoretical guarantees. In this work, we propose and study a GP model that achieves robustness against sparse outliers by inferring data-point-specific noise levels with a sequential selection procedure maximizing the log marginal likelihood that we refer to as relevance pursuit. We show, surprisingly, that the model can be parameterized such that the associated log marginal likelihood is strongly concave in the data-point-specific noise variances, a property rarely found in either robust regression objectives or GP marginal likelihoods. This in turn implies the weak submodularity of the corresponding subset selection problem, and thereby proves approximation guarantees for the proposed algorithm. We compare the model's performance relative to other approaches on diverse regression and Bayesian optimization tasks, including the challenging but common setting of sparse corruptions of the labels within or close to the function range.

Paper Structure

This paper contains 52 sections, 15 theorems, 58 equations, 10 figures, 7 tables, 2 algorithms.

Key Result

Lemma 0

[Optimal Robust Variances] Let $\mathcal{D}_{\backslash i} = \{({\mathbf x}_j, y_j): j \neq i\}$, ${\boldsymbol \rho} = {\boldsymbol \rho}_{\backslash i} + \rho_i \mathbf e_i$, where ${\boldsymbol \rho}, {\boldsymbol \rho}_{\backslash i} \in \mathbb{R}_+^n$, $[{\boldsymbol \rho}_{\backslash i}]_i = where $y({\mathbf x}_i) = f({\mathbf x}_i) + \epsilon_i$. These quantities can be expressed as func

Figures (10)

  • Figure 1: Comparison of RRP to a standard GP and a variational GP with a Student-$t$ likelihood on a regression example. While the other models are led astray by the corrupted observations, RRP successfully identifies the corruptions (red) and thus achieves a much better fit to the ground truth.
  • Figure 2: Left: Evolution of model posterior during Relevance Pursuit, as the number of data-point-specific variances $|S|$ increases (from light colors to dark). Red points indicate corruptions that were generated by uniformly sampling from the function's range. Right: Comparison of posterior marginal likelihoods as a function of a model's $|S|$. The maximizer -- boxed in black -- is the preferred model.
  • Figure 3: Top: The behavior of the $-\log\mathcal{L}(\rho)$ with respect to the canonical parameterization of ${\boldsymbol \rho}$. Bottom: The behavior of $-\log\mathcal{L}(\rho({\mathbf s}))$, highlighting the convexity property. Left: The value, and first two derivatives of $-\log\mathcal{L}$ for a 1d example. Center: The second derivatives of a 1d $-\log\mathcal{L}$ as a function of $|y|$. The ${\mathbf s}$-parameterization is everwhere convex for all considered $|y|$, while the canonical ${\boldsymbol \rho}$-parameterization is only convex around the origin and only for $|y| > 0.5$. Right: The heatmaps highlight that the original parameterization is non-convex (red) for larger values of $\rho$, and quickly becomes ill-conditioned, whereas the parameterization ${\boldsymbol \rho}({\mathbf s})$ is convex and much better conditioned.
  • Figure 4: Left: Distribution of predictive test-set log likelihood for various methods. Methods ommitted are those that performed substantially worse. Right: Predictive log likelihood as a function of the corruption probability for Student-$t$-distributed corruptions with two degrees of freedom. The GP model with the Student-$t$ likelihood only starts outperforming RRP as the corruption probability increases beyond 40%, and exhibits a large variance in outcomes, which shrinks as the proportion of corruptions increases. All methods not shown were inferior to either RRP or Student-$t$.
  • Figure 5: Results on the intra-day data from the Dow Jones Industrial Average (DJIA) index on April 22-23 2013, which includes a sharp drop at 13:10 on the 23rd, see (b) for a detailed view. The accompanying panels labeled $w_{\text{imq}}$ show the function that altamirano2023robust's RCGP uses to down-weight data points. Top: RCGP, exhibits higher robustness than the standard GP, but is still affected by the outliers. The RRP model is virtually unaffected. Bottom: Including the previous trading day into the training data in (c), leads RCGP to assign the highest weight $w_{\text{imq}}$ to the outlying data points due to their proximity to the target values' median, thereby leading RCGP to be even more affected than a standard GP, see (d) for a detailed view of the results on the data of April 23.
  • ...and 5 more figures

Theorems & Definitions (27)

  • Lemma 0
  • Definition 1: Restricted Isometry Property
  • Definition 2: Restricted Strong Convexity and Smoothness
  • Lemma 2
  • Lemma 2
  • Definition 3: Diagonal Dominance
  • Lemma 3
  • Theorem 4
  • Definition 5: Submodularity Ratios elenberg2018restricted
  • Theorem 6: Weak Submodularity via RSC elenberg2018restricted
  • ...and 17 more