A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

Neha S. Wadia; Yatin Dandi; Michael I. Jordan

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

Neha S. Wadia, Yatin Dandi, Michael I. Jordan

TL;DR

This work provides a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities.

Abstract

The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition.

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

TL;DR

Abstract

Paper Structure (26 sections, 11 theorems, 57 equations, 7 figures, 5 algorithms)

This paper contains 26 sections, 11 theorems, 57 equations, 7 figures, 5 algorithms.

Introduction
The Challenges of Decision-Making Processes
Multi-Way Markets
Challenges at the Intersection of Machine Learning and Economics
Two Illustrative Examples
Strategic Classification
Distribution-Free Uncertainty Quantification for Decision-Making
Overview of the Lectures
The Subgradient Method and a First Convergence Proof
Computing Optima in Discrete and Continuous Time
Convergence Guarantees for Gradient Descent on Convex Functions
Gradient Descent on Nonconvex Functions: Escaping Saddle Points Efficiently
Variational, Hamiltonian, and Symplectic Perspectives on Acceleration
An open problem.
Variational Inequalities: From Minima to Nash Equilibria and Fixed Points
...and 11 more sections

Key Result

Theorem 1.1

On a convex function $f$, the subgradient method converges in function value on the average iterate with a rate $1/\sqrt{T}$, where $T$ is the number of iterations.

Figures (7)

Figure 1: The poverty index score was approximately Gaussian-distributed at the time when it was selected as a measure of poverty that could be thresholded for the purposes of making social policy. With time, its distribution became further and further skewed to the left of the threshold. This figure is reproduced from camacho2011manipulation.
Figure 2: The notion of a fixed point is broader than that of a minimum. Here we have examples of two types of vector fields $F(x)$. In (a), the negative flow (indicated by arrow heads) of the field is toward the fixed point; this fixed point could be a minimum. In (b), the negative flow of the vector field is around the fixed point; this fixed point is not a minimum. In this lecture, we study algorithms that compute the fixed points of vector fields associated with monotone operators, which can be either of the flavor (a) or (b).
Figure 3: Monotonicity implies that the angle $\theta$ between the operator $-F(x)$ and the vector pointing from $x$ to $x^{\star}$ is at most $\pi/2$ for all $x$. To see this in \ref{['eq:lec3-monotonicity']}, take $x=x^{\star}$, $y=x$, and without loss of generality let $F(x^{\star})=0$.
Figure 4: While monotonicity of $F$ guarantees that $x^{\star}$ is on the same side of the hyperplane defined by $-F$ at $x$ as the quadratic $\mu||x-y||^2$, strong monotonicity additionally guarantees that $x^{\star}$ is on or above the quadratic. We use the label $x^{\star}_M$ ($x^{\star}_{SM}$) to indicate a possible fixed point of $F$ when $F$ is monotone (strongly monotone). In general, $x^{\star}_M$ may be either aligned with the axis $y$ or anywhere above it, while $x^{\star}_{SM}$ is confined to be on or above $\mu||x-y||^2$.
Figure 5: Projection onto a convex set $\mathcal{X}$ is a contractive operation, which means that the difference of projections is at most the difference of the arguments.
...and 2 more figures

Theorems & Definitions (26)

Definition 1.1: Subgradient
Definition 1.2: Subdifferential
Theorem 1.1
Definition 2.1: Convexity
Definition 2.2: Smoothness
Lemma 2.1: Descent lemma for gradient descent on smooth convex functions
proof
Theorem 2.2
proof
Definition 2.3: Strong convexity
...and 16 more

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

TL;DR

Abstract

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (26)