Unbiased least squares regression via averaged stochastic gradient descent

Nabil Kahalé

Unbiased least squares regression via averaged stochastic gradient descent

Nabil Kahalé

TL;DR

This work tackles online least-squares regression by making the time-average SGD estimator unbiased via randomized multilevel Monte Carlo. It constructs unbiased estimators for the bias-corrected mean and for the minimizer, achieving an expected time of order $k$ per target level and an $O(1/k)$ excess risk, with poly-logarithmic dependence on the smallest eigenvalue of the Hessian. It also develops unbiased estimators for squared bias and variances, enabling unbiased risk assessments for multiple copies and both standard and average-start variants without knowledge of $H$ or $\theta^*$. Empirical results on synthetic Gaussian setups corroborate the theory, illustrating efficiency and parallelizability of the proposed estimators. The approach offers a principled way to quantify and reduce bias and variance in online least-squares with scalable unbiased estimation techniques.

Abstract

We consider an on-line least squares regression problem with optimal solution $θ^*$ and Hessian matrix H, and study a time-average stochastic gradient descent estimator of $θ^*$. For $k\ge2$, we provide an unbiased estimator of $θ^*$ that is a modification of the time-average estimator, runs with an expected number of time-steps of order k, with O(1/k) expected excess risk. The constant behind the O notation depends on parameters of the regression and is a poly-logarithmic function of the smallest eigenvalue of H. We provide both a biased and unbiased estimator of the expected excess risk of the time-average estimator and of its unbiased counterpart, without requiring knowledge of either H or $θ^*$. We describe an "average-start" version of our estimators with similar properties. Our approach is based on randomized multilevel Monte Carlo. Our numerical experiments confirm our theoretical findings.

Unbiased least squares regression via averaged stochastic gradient descent

TL;DR

per target level and an

excess risk, with poly-logarithmic dependence on the smallest eigenvalue of the Hessian. It also develops unbiased estimators for squared bias and variances, enabling unbiased risk assessments for multiple copies and both standard and average-start variants without knowledge of

. Empirical results on synthetic Gaussian setups corroborate the theory, illustrating efficiency and parallelizability of the proposed estimators. The approach offers a principled way to quantify and reduce bias and variance in online least-squares with scalable unbiased estimation techniques.

Abstract

We consider an on-line least squares regression problem with optimal solution

and Hessian matrix H, and study a time-average stochastic gradient descent estimator of

. For

, we provide an unbiased estimator of

that is a modification of the time-average estimator, runs with an expected number of time-steps of order k, with O(1/k) expected excess risk. The constant behind the O notation depends on parameters of the regression and is a poly-logarithmic function of the smallest eigenvalue of H. We provide both a biased and unbiased estimator of the expected excess risk of the time-average estimator and of its unbiased counterpart, without requiring knowledge of either H or

. We describe an "average-start" version of our estimators with similar properties. Our approach is based on randomized multilevel Monte Carlo. Our numerical experiments confirm our theoretical findings.

Paper Structure (32 sections, 28 theorems, 147 equations, 4 figures, 4 tables)

This paper contains 32 sections, 28 theorems, 147 equations, 4 figures, 4 tables.

Introduction
Contributions
Other related work
Notation and Background
Bias and Variance
Main Assumption
Time-average estimators
Example
Unbiased time-average estimators
The single term estimator
Unbiased estimator construction
Construction of $(f_{k,l},l\geq0)$
Construction of $Z_{k}$ and of $\hat{f}_{k}$
Unbiased estimator with average start
Squared Bias and Variance Estimation
...and 17 more sections

Key Result

Lemma 2.1

Assume that $(x,y)$ is square-integrable, that $((x_{t},y_{t}), t\geq0)$ is a sequence of independent copies of $(x,y)$, that Assumption 1 holds, and that Then and, for $k\geq1$, we have and Moreover, If $H$ is positive-definite then $E(\theta_{k})\rightarrow \theta^{*}$ as $k$ goes to infinity.

Figures (4)

Figure 1: Product of Average Running Time and of Variance with $H=\hbox{\rm diag}(i^{-3})$, $1\le i\le 25$, $\sigma^{2}=1$, $s(k)=k/2$ and $n=10^{8}/k$.
Figure 2: Expected Excess Risk with $H=\hbox{\rm diag}(i^{-3})$, $1\le i\le 25$, $\sigma^{2}=1$, $s(k)=k/2$ and $n=10^{8}/k$.
Figure 3: Product of Average Running Time and of Variance with $H=\hbox{\rm diag}(i^{-1})$, $1\le i\le 50$, $\sigma^{2}=0.01$, $s(k)=k/2$ and $n=10^{8}/k$.
Figure 4: Expected Excess Risk with $H=\hbox{\rm diag}(i^{-1})$, $1\le i\le 50$, $\sigma^{2}=0.01$, $s(k)=k/2$ and $n=10^{8}/k$.

Theorems & Definitions (44)

Lemma 2.1
Proposition 2.1
Theorem 3.1: GlynnRhee2015unbiased
Corollary 3.1
Lemma 3.1
Lemma 3.2
Lemma 3.3
Theorem 3.2
Proposition 3.1
Theorem 3.3
...and 34 more

Unbiased least squares regression via averaged stochastic gradient descent

TL;DR

Abstract

Unbiased least squares regression via averaged stochastic gradient descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (44)