Hyperparameter tuning via trajectory predictions: Stochastic prox-linear methods in matrix sensing

Mengqi Lou; Kabir Aladin Verchand; Ashwin Pananjady

Hyperparameter tuning via trajectory predictions: Stochastic prox-linear methods in matrix sensing

Mengqi Lou, Kabir Aladin Verchand, Ashwin Pananjady

TL;DR

A mini-batched prox-linear iterative algorithm for the canonical problem of recovering an unknown rank-1 matrix from rank-1 Gaussian measurements corrupted by noise is analyzed and it is revealed that this method, though stochastic, converges linearly from a local initialization with a fixed step-size to a statistical error floor.

Abstract

Motivated by the desire to understand stochastic algorithms for nonconvex optimization that are robust to their hyperparameter choices, we analyze a mini-batched prox-linear iterative algorithm for the problem of recovering an unknown rank-1 matrix from rank-1 Gaussian measurements corrupted by noise. We derive a deterministic recursion that predicts the error of this method and show, using a non-asymptotic framework, that this prediction is accurate for any batch-size and a large range of step-sizes. In particular, our analysis reveals that this method, though stochastic, converges linearly from a local initialization with a fixed step-size to a statistical error floor. Our analysis also exposes how the batch-size, step-size, and noise level affect the (linear) convergence rate and the eventual statistical estimation error, and we demonstrate how to use our deterministic predictions to perform hyperparameter tuning (e.g. step-size and batch-size selection) without ever running the method. On a technical level, our analysis is enabled in part by showing that the fluctuations of the empirical iterates around our deterministic predictions scale with the error of the previous iterate.

Hyperparameter tuning via trajectory predictions: Stochastic prox-linear methods in matrix sensing

TL;DR

Abstract

Paper Structure (94 sections, 21 theorems, 435 equations, 3 figures)

This paper contains 94 sections, 21 theorems, 435 equations, 3 figures.

Introduction
Motivation and main contributions
Motivation #1: Fine-grained convergence phenomena.
Motivation #2: Efficient hyperparameter tuning.
Main Contributions.
Related work
Stochastic aProx methods
Learning dynamics and trajectory analyses
Notation and organization
Main Results
Deterministic one-step predictions
Convergence result
Trajectory predictions and application to hyperparameter tuning
Obtaining a trajectory prediction
Hyperparameter tuning
...and 79 more sections

Key Result

Theorem 1

Suppose data are drawn from the model eq:model and let Assumptions assumption-unit-norm and assumption hold. Let $\boldsymbol{\mu}_{\sharp}, \boldsymbol{\nu}_{\sharp} \in \mathbb{R}^d$ satisfy $K_1 \leq \| \boldsymbol{\mu}_\sharp \|_2, \| \boldsymbol{\nu}_\sharp \|_2 \leq K_2$ for a pair of universa the following hold with probability at least $1 - d^{-20}$.

Figures (3)

Figure 1: Panel (a) demonstrates the convergence behavior of different batch-sizes $m = 8,16,32$. Panel (b) demonstrates the convergence behavior of different initial inverse step-sizes $\lambda_{0} = 1,10,100$. Each experiment consists of $30$ independent trials and shaded envelopes denote the interquartile range over the $30$ independent trials. Solid lines denote the median of $\mathsf{Err}_{t}$ over the independent trials and dash lines (barely visible) denote the deterministic predicted error $\mathsf{Err}_{t}^{\mathsf{seq}}$ (see Section \ref{['sec:experimental-results']} for its definition).
Figure 2: Low noise (panel (a)) and high noise (panel (b)) behavior of the prox-linear method for batch-sizes $m=4,8,16,32$ and inverse step-size choice $\lambda = (1+\sigma^{2})d/m$. Each experiment starts from an initialization satisfying $\alpha_{0} = \widetilde{\alpha}_{0} = 0.99$ and $\|\boldsymbol{\mu}_{0}\|_{2} = \|\boldsymbol{\nu}_{0}\|_{2} = 1$ and runs to convergence. In panel (a), each experiment consists of $10$ independent trials and the shaded envelopes ($m=4$) denote the range over the $10$ trials. In panel (b), each experiment consists of $30$ independent trials and the shaded envelopes denote the interquartile range over the $30$ trials. Solid lines denote the median of the empirical error $\mathsf{Err}_{t}$\ref{['def:empirical-error-iterations']} over the independent trials and dashed lines (barely visible) denote the predicted error $\mathsf{Err}_{t}^{\mathsf{seq}}$\ref{['eq:error-prediction']}.
Figure 3: Low noise (panel (a)) and high noise (panel (b)) behavior of the prox-linear method for batch-size $m=32$ and inverse step-size selections $\lambda = 1, 10, 100, 200$. Each experiment starts from an initialization satisfying $\alpha_{0} = \widetilde{\alpha}_{0} = 0.99$ and $\|\boldsymbol{\mu}_{0}\|_{2} = \|\boldsymbol{\nu}_{0}\|_{2} = 1$ and runs to convergence. In panel (a), each experiment consists of $10$ independent trials and shaded envelopes ($\lambda = 200$) denote the range over the $10$ trials. In panel (b), each experiment consists of $30$ independent trials and shaded envelopes denote the interquartile range over the $30$ trials. Solid lines denote the median of $\mathsf{Err}_{t}$\ref{['def:empirical-error-iterations']} over the independent trials and dashed lines (barely visible) denote the predicted error $\mathsf{Err}_{t}^{\mathsf{seq}}$\ref{['eq:error-prediction']}.

Theorems & Definitions (24)

Theorem 1
Theorem 2
Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5: Stochastic error of the fully orthogonal component
Lemma 6: Bias of the fully orthogonal component
Lemma 7
Lemma 8
...and 14 more

Hyperparameter tuning via trajectory predictions: Stochastic prox-linear methods in matrix sensing

TL;DR

Abstract

Hyperparameter tuning via trajectory predictions: Stochastic prox-linear methods in matrix sensing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)