Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

Reza Ghane; Danil Akhtiamov; Babak Hassibi

Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

Reza Ghane, Danil Akhtiamov, Babak Hassibi

TL;DR

This work proves that the iterates of the preconditioned gradient descent always converge to a point, and introduces a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence.

Abstract

In this work we study the convergence properties of the Dual Space Preconditioned Gradient Descent, encompassing optimizers such as Normalized Gradient Descent, Gradient Clipping and Adam. We consider preconditioners of the form $\nabla K$, where $K: \mathbb{R}^p \to \mathbb{R}$ is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form $\ell({X} {W} - {Y})$, for weights ${W} \in \mathbb{R}^{d \times k}$, labels ${Y} \in \mathbb{R}^{n \times k}$ and data ${X} \in \mathbb{R}^{n \times d}$. Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point ${W}_{\infty} \in \mathbb{R}^{d \times k}$ satisfying ${X}{W}_{\infty} = {Y}$. Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence. We also study the implicit bias of Dual Space Preconditioned Gradient Descent. First, we demonstrate empirically that, for general $K(\cdot)$, ${W}_\infty$ depends on the chosen learning rate, hindering a precise characterization of the implicit bias. Then, for preconditioners of the form $K({G}) = h(\|{G}\|_F)$, known as \textit{isotropic preconditioners}, we show that ${W}_\infty$ minimizes $\|{W}_\infty - {W}_0\|_F^2$ subject to ${X}{W}_\infty = {Y}$, where ${W}_0$ is the initialization. Denoting the convergence point of GD initialized at ${W}_0$ by ${W}_{\text{GD}, \infty}$, we thus note ${W}_{\infty} = {W}_{\text{GD}, \infty}$ for isotropic preconditioners. Finally, we show that a similar fact holds for general preconditioners up to a multiplicative constant, namely, $\|{W}_0 - {W}_{\infty}\|_F \le c \|{W}_0 - {W}_{\text{GD}, \infty}\|_F$ for a constant $c>0$.

Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

TL;DR

Abstract

, where

is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form

, for weights

, labels

and data

. Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point

satisfying

. Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence. We also study the implicit bias of Dual Space Preconditioned Gradient Descent. First, we demonstrate empirically that, for general

depends on the chosen learning rate, hindering a precise characterization of the implicit bias. Then, for preconditioners of the form

, known as \textit{isotropic preconditioners}, we show that

minimizes

subject to

, where

is the initialization. Denoting the convergence point of GD initialized at

, we thus note

for isotropic preconditioners. Finally, we show that a similar fact holds for general preconditioners up to a multiplicative constant, namely,

for a constant

Paper Structure (13 sections, 11 theorems, 92 equations, 2 figures)

This paper contains 13 sections, 11 theorems, 92 equations, 2 figures.

Introduction
Main Results and Applications
Examples and Discussion
Outline of the proofs
Convergence
Proximity to GD
Experiments
Conclusion
Details of the Experiments
Proof of Proposition 1 &Theorem 1
Proof of Theorem 2 Part 1
Proof of Theorem 2 Part 2
Auxiliary Lemmata

Key Result

Proposition 1

For $K, \mathcal{L}$ satisfying Assumptions 1.1 & 1.2, and for any $\mathbf{W} \in \mathbb{R}^d$ from Definition def: breg and $\{\mathbf{W}_i\}_{i=1}$ generated according to alg: precond:

Figures (2)

Figure 1: Distance from $p = 1, 2, \infty$ solutions in \ref{['exp: lp']} for the solution obtained using \ref{['exp: precond']}.
Figure 2: The distance of solutions obtained using \ref{['exp: precond']} from $\mathbf{W}_{\text{GD},\infty}$ and $\mathbf{W}_{\text{ref}}$ normalized by the norm of $\mathbf{W}_{\text{ref}}$.

Theorems & Definitions (26)

Definition 1: Strong Convexity
Definition 2
Definition 3
Definition 4: Bregman Divergence
Definition 5: Adjusted Bregman Divergence
Proposition 1
Theorem 1
Theorem 2
Remark 1
Lemma 1
...and 16 more

Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

TL;DR

Abstract

Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (26)