Table of Contents
Fetching ...

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

Neha S. Wadia, Yatin Dandi, Michael I. Jordan

TL;DR

This work provides a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities.

Abstract

The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

TL;DR

This work provides a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities.

Abstract

The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.
Paper Structure (26 sections, 11 theorems, 57 equations, 7 figures, 5 algorithms)

This paper contains 26 sections, 11 theorems, 57 equations, 7 figures, 5 algorithms.

Key Result

Theorem 1.1

On a convex function $f$, the subgradient method converges in function value on the average iterate with a rate $1/\sqrt{T}$, where $T$ is the number of iterations.

Figures (7)

  • Figure 1: The poverty index score was approximately Gaussian-distributed at the time when it was selected as a measure of poverty that could be thresholded for the purposes of making social policy. With time, its distribution became further and further skewed to the left of the threshold. This figure is reproduced from camacho2011manipulation.
  • Figure 2: The notion of a fixed point is broader than that of a minimum. Here we have examples of two types of vector fields $F(x)$. In (a), the negative flow (indicated by arrow heads) of the field is toward the fixed point; this fixed point could be a minimum. In (b), the negative flow of the vector field is around the fixed point; this fixed point is not a minimum. In this lecture, we study algorithms that compute the fixed points of vector fields associated with monotone operators, which can be either of the flavor (a) or (b).
  • Figure 3: Monotonicity implies that the angle $\theta$ between the operator $-F(x)$ and the vector pointing from $x$ to $x^{\star}$ is at most $\pi/2$ for all $x$. To see this in \ref{['eq:lec3-monotonicity']}, take $x=x^{\star}$, $y=x$, and without loss of generality let $F(x^{\star})=0$.
  • Figure 4: While monotonicity of $F$ guarantees that $x^{\star}$ is on the same side of the hyperplane defined by $-F$ at $x$ as the quadratic $\mu||x-y||^2$, strong monotonicity additionally guarantees that $x^{\star}$ is on or above the quadratic. We use the label $x^{\star}_M$ ($x^{\star}_{SM}$) to indicate a possible fixed point of $F$ when $F$ is monotone (strongly monotone). In general, $x^{\star}_M$ may be either aligned with the axis $y$ or anywhere above it, while $x^{\star}_{SM}$ is confined to be on or above $\mu||x-y||^2$.
  • Figure 5: Projection onto a convex set $\mathcal{X}$ is a contractive operation, which means that the difference of projections is at most the difference of the arguments.
  • ...and 2 more figures

Theorems & Definitions (26)

  • Definition 1.1: Subgradient
  • Definition 1.2: Subdifferential
  • Theorem 1.1
  • Definition 2.1: Convexity
  • Definition 2.2: Smoothness
  • Lemma 2.1: Descent lemma for gradient descent on smooth convex functions
  • proof
  • Theorem 2.2
  • proof
  • Definition 2.3: Strong convexity
  • ...and 16 more