Table of Contents
Fetching ...

On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, Edison Guo

TL;DR

The paper derives exact gradient expressions for differentiating parameterized argmin and argmax problems in bi-level optimization, covering unconstrained and constrained lower-level problems. It presents general implicit-differentiation formulas, extends them to equality and inequality constraints via null-space and barrier methods, and demonstrates applications with scalar and softmax exemplars. A bi-level learning example shows how to adjust model parameters to steer the location of maximum-likelihood features, highlighting practical end-to-end learning potential. The discussion addresses computational considerations and suggests directions for scalable and non-smooth settings in real-world AI tasks.

Abstract

Some recent works in machine learning and computer vision involve the solution of a bi-level optimization problem. Here the solution of a parameterized lower-level problem binds variables that appear in the objective of an upper-level problem. The lower-level problem typically appears as an argmin or argmax optimization problem. Many techniques have been proposed to solve bi-level optimization problems, including gradient descent, which is popular with current end-to-end learning approaches. In this technical report we collect some results on differentiating argmin and argmax optimization problems with and without constraints and provide some insightful motivating examples.

On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization

TL;DR

The paper derives exact gradient expressions for differentiating parameterized argmin and argmax problems in bi-level optimization, covering unconstrained and constrained lower-level problems. It presents general implicit-differentiation formulas, extends them to equality and inequality constraints via null-space and barrier methods, and demonstrates applications with scalar and softmax exemplars. A bi-level learning example shows how to adjust model parameters to steer the location of maximum-likelihood features, highlighting practical end-to-end learning potential. The discussion addresses computational considerations and suggests directions for scalable and non-smooth settings in real-world AI tasks.

Abstract

Some recent works in machine learning and computer vision involve the solution of a bi-level optimization problem. Here the solution of a parameterized lower-level problem binds variables that appear in the objective of an upper-level problem. The lower-level problem typically appears as an argmin or argmax optimization problem. Many techniques have been proposed to solve bi-level optimization problems, including gradient descent, which is popular with current end-to-end learning approaches. In this technical report we collect some results on differentiating argmin and argmax optimization problems with and without constraints and provide some insightful motivating examples.

Paper Structure

This paper contains 14 sections, 7 theorems, 50 equations, 6 figures.

Key Result

Lemma 3.1

: Let $f: \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}$ be a continuous function with first and second derivatives. Let $g(x) = \mathop{\textrm{argmin}}_{y} f(x, y)$. Then the derivative of $g$ with respect to $x$ is where $f_{XY} \doteq \frac{\partial^2f}{\partial x \partial y}$ and $f_{YY} \doteq \frac{\partial^2f}{\partial y^2}$.

Figures (6)

  • Figure 1: Example of a parameterized scalar function $f(x, y) = xy^4 + 2x^2y^3 - 12y^2$ with three stationary points for any fixed $x > 0$. The top-left panel shows a contour plot of $f$; the bottom-left panel shows the function at $x = 1$; and the remaining panels show the three solutions for $g(x) = \mathop{\textrm{argmin}}_y f(x, y)$ and corresponding gradients $g'(x)$ at each stationary point.
  • Figure 2: Example maximum-likelihood surfaces $\ell_i({\boldsymbol{x}})$ before and after taking a small step on all parameters in the negative gradient direction for $x^\star_1$.
  • Figure 3: Example graph of both function value and gradient for $g(x) = \mathop{\textrm{argmin}}_{y \geq 0} (x - y)^2$ and approximations $g_t(x) = \mathop{\textrm{argmin}}_{y} t (x - y)^2 - \log(y)$ for different values of $t$. As $t \rightarrow \infty$ the approximation converges to the actual function and gradient.
  • Figure 4: Example maximum-likelihood surfaces $\ell_i({\boldsymbol{x}})$ before and after taking a small step on all parameters in the negative gradient direction for $x^\star_1$ with constraint $\textbf{1}^T {\boldsymbol{x}} = 1$.
  • Figure 5: Example maximum-likelihood surfaces $\ell_i({\boldsymbol{x}})$ before and after taking a small step on all parameters in the negative gradient direction for $x^\star_1$ with constraint $\|{\boldsymbol{x}}\|_2 \leq 1$.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Lemma 3.1
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • proof
  • Lemma 4.1
  • ...and 3 more