Table of Contents
Fetching ...

Learning Theory for Kernel Bilevel Optimization

Fares El Khoury, Edouard Pauwels, Samuel Vaiter, Michael Arbel

TL;DR

This work establishes the first finite-sample generalization bounds for Kernel Bilevel Optimization (KBO), where the inner problem is solved in a reproducing kernel Hilbert space and the outer objective is an expectation of a pointwise loss. It derives a gradient representation via functional implicit differentiation in RKHS, introduces practical plug-in estimators for the value function and its gradient, and proves uniform generalization bounds using empirical process theory and degenerate U-processes, yielding rates of order $O\left(\frac{1}{\sqrt{m}}+\frac{1}{\sqrt{n}}\right)$. The authors also show the equivalence of two gradient estimators and provide convergence guarantees for bilevel gradient methods, supported by numerical experiments on synthetic instrumental variable regression. These results inform sample-computation trade-offs in nonparametric bilevel learning and guide kernel-based hyperparameter tuning and related tasks under distribution shift.

Abstract

Bilevel optimization has emerged as a technique for addressing a wide range of machine learning problems that involve an outer objective implicitly determined by the minimizer of an inner problem. While prior works have primarily focused on the parametric setting, a learning-theoretic foundation for bilevel optimization in the nonparametric case remains relatively unexplored. In this paper, we take a first step toward bridging this gap by studying Kernel Bilevel Optimization (KBO), where the inner objective is optimized over a reproducing kernel Hilbert space. This setting enables rich function approximation while providing a foundation for rigorous theoretical analysis. In this context, we derive novel finite-sample generalization bounds for KBO, leveraging tools from empirical process theory. These bounds further allow us to assess the statistical accuracy of gradient-based methods applied to the empirical discretization of KBO. We numerically illustrate our theoretical findings on a synthetic instrumental variable regression task.

Learning Theory for Kernel Bilevel Optimization

TL;DR

This work establishes the first finite-sample generalization bounds for Kernel Bilevel Optimization (KBO), where the inner problem is solved in a reproducing kernel Hilbert space and the outer objective is an expectation of a pointwise loss. It derives a gradient representation via functional implicit differentiation in RKHS, introduces practical plug-in estimators for the value function and its gradient, and proves uniform generalization bounds using empirical process theory and degenerate U-processes, yielding rates of order . The authors also show the equivalence of two gradient estimators and provide convergence guarantees for bilevel gradient methods, supported by numerical experiments on synthetic instrumental variable regression. These results inform sample-computation trade-offs in nonparametric bilevel learning and guide kernel-based hyperparameter tuning and related tasks under distribution shift.

Abstract

Bilevel optimization has emerged as a technique for addressing a wide range of machine learning problems that involve an outer objective implicitly determined by the minimizer of an inner problem. While prior works have primarily focused on the parametric setting, a learning-theoretic foundation for bilevel optimization in the nonparametric case remains relatively unexplored. In this paper, we take a first step toward bridging this gap by studying Kernel Bilevel Optimization (KBO), where the inner objective is optimized over a reproducing kernel Hilbert space. This setting enables rich function approximation while providing a foundation for rigorous theoretical analysis. In this context, we derive novel finite-sample generalization bounds for KBO, leveraging tools from empirical process theory. These bounds further allow us to assess the statistical accuracy of gradient-based methods applied to the empirical discretization of KBO. We numerically illustrate our theoretical findings on a synthetic instrumental variable regression task.

Paper Structure

This paper contains 36 sections, 30 theorems, 187 equations, 3 figures.

Key Result

Proposition 2.2

Under assump:K_measassump:compactassump:convexity_linassump:K_boundedassump:reg_lin_lout, $\mathcal{F}$ is differentiable on $\mathbb{R}^d$, with gradient $\nabla\mathcal{F}(\omega)$, for any $\omega\in \mathbb{R}^d$, given by: where the adjoint function$a^\star_\omega\in\mathcal{H}$ is the unique minimizer of a strongly convex quadratic objective $a\mapsto L_{adj}(\omega,a)$ defined on $\mathcal

Figures (3)

  • Figure 1: A commutative diagram illustrating that plug-in statistical estimation and differentiation can be interchanged for $\mathcal{F}$ and $\widehat{\mathcal{F}}$ resulting in a single gradient estimator.
  • Figure 2: Illustration of gradient descent on \ref{['eq:kbo_app']} for the instrumental variable regression task using synthetic data. The plots are averaged over 50 runs and displayed on a log-log scale. The line represents the mean across all runs, and the shaded region indicates the 95% confidence interval.
  • Figure 3: Illustration of gradient descent on \ref{['eq:kbo_app']} for the instrumental variable regression task using synthetic data, with an instrumental variable sampled from a standard Gaussian distribution. The logs of the means of the four quantities across 50 runs are displayed.

Theorems & Definitions (57)

  • Remark 2.1
  • Proposition 2.2: Expression of the total gradient
  • Proposition 3.1
  • Theorem 4.1: Maximal inequalities
  • Corollary 4.2: Generalization for bilevel gradient descent
  • Proposition B.1: Differentiability of $L_{in}$ and $L_{out}$
  • Proposition B.2: Differentiability of $\partial_h L_{in}$
  • Proposition B.3: Strong convexity of the inner objective in its second variable and invertibility of the Hessians
  • proof
  • Proposition B.4: Total functional gradient $\nabla\mathcal{F}$
  • ...and 47 more