Table of Contents
Fetching ...

Solving Hidden Monotone Variational Inequalities with Surrogate Losses

Ryan D'Orazio, Danilo Vucetic, Zichu Liu, Junhyung Lyle Kim, Ioannis Mitliagkas, Gauthier Gidel

TL;DR

This work proposes a principled surrogate-based approach compatible with deep learning to solve VIs and shows that it guarantees convergence, provides a unifying perspective of existing methods, and is amenable to existing deep learning optimizers like ADAM.

Abstract

Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minimizing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.

Solving Hidden Monotone Variational Inequalities with Surrogate Losses

TL;DR

This work proposes a principled surrogate-based approach compatible with deep learning to solve VIs and shows that it guarantees convergence, provides a unifying perspective of existing methods, and is amenable to existing deep learning optimizers like ADAM.

Abstract

Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minimizing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.

Paper Structure

This paper contains 31 sections, 15 theorems, 83 equations, 10 figures, 4 algorithms.

Key Result

Theorem 3.2

Let Assumption assumption:vi hold and let $\{z_t = g(\theta_t)\}_{t\in \mathbb{N}}$ be the iterates produced by Algorithm alg:surr. If $\alpha$ and $\eta$ are picked such that $\rho:= 1-2\eta(\mu-\alpha L) + (1+\alpha^2)\eta^2 L^2 <1$ then, $z_t$ converge linearly to the solution $z_\ast$ at the fol Particularly, if $\alpha< \frac{\mu}{L}$ and $\eta < \frac{2 (\mu-\alpha L)}{(1+\alpha^2)L^2}$ then

Figures (10)

  • Figure 1: Convergence of various algorithms from Section \ref{['sec:non-linear']} on the hidden matching pennies game. PHGD and GDA as presented in sakos2024exploiting are compared against GN, DGN, LM, and GD. (left) Linear convergence to the equilbrium is observed for several methods with LM and GD outperforming the rest. (middle) Trajectories for some methods are plotted in both the parameter and prediction space. (right) The loss ratio $\ell_t(\theta_{t+1})/\ell_t(\theta_t)$ is illustrated for the considered methods.
  • Figure 2: Convergence in the hidden rps game.
  • Figure 3: The average approximation error between GD on the surrogate (\ref{['eq:surr-stoch-berts']}, Surr-GD) and update \ref{['eq:bert-update-stoch']} over 10,000 runs in a slow mixing 100-state Markov chain from bertsekas2009projected and Hu_Berts_markov_chains_ex. Surr-GD is observed to converge to the exact update \ref{['eq:bert-update-stoch']} with faster convergence for more inner steps.
  • Figure 4: Comparison of average performance of TD(0) and surrogate methods in minimizing the value prediction error for RL tasks with nonlinear function approximation in Ant (top) and HalfCheetah (bottom) environments, measured by outer loop iterations (left) and wallclock time (right). The average value prediction error across 20 runs along with 95% confidence intervals are computed from a fixed test set. The red dashed line represents the lowest value prediction error achieved by any of the algorithms.
  • Figure 5: Time in seconds for performing 10,000 iterations of each method. The number in parenthesis correspond to number of inner steps taken. As a special case we have PHGD and GDA are equivalent to GN(1) and Surr-GD(1) respectively.
  • ...and 5 more figures

Theorems & Definitions (30)

  • Definition 2.1: $\alpha$-descent
  • Theorem 3.2
  • Proposition 3.3
  • Definition 3.5: $\alpha$-expected descent
  • Theorem 3.6
  • Proposition 4.1
  • Proposition 4.2
  • Remark A.1
  • proof
  • Lemma A.2
  • ...and 20 more