Interactive Learning of Single-Index Models via Stochastic Gradient Descent

Nived Rajaraman; Yanjun Han

Interactive Learning of Single-Index Models via Stochastic Gradient Descent

Nived Rajaraman, Yanjun Han

TL;DR

It is shown that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in''phase before entering the ``learning''phase in this setting, and with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal sample complexity and regret guarantees across both phases.

Abstract

Stochastic gradient descent (SGD) is a cornerstone algorithm for high-dimensional optimization, renowned for its empirical successes. Recent theoretical advances have provided a deep understanding of how SGD enables feature learning in high-dimensional nonlinear models, most notably the \textit{single-index model} with i.i.d. data. In this work, we study the sequential learning problem for single-index models, also known as generalized linear bandits or ridge bandits, where SGD is a simple and natural solution, yet its learning dynamics remain largely unexplored. We show that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in'' phase before entering the ``learning'' phase in this setting. Moreover, with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal (or best-known) sample complexity and regret guarantees across both phases, for a broad class of link functions. Our results demonstrate that SGD remains highly competitive for learning single-index models under adaptive data.

Interactive Learning of Single-Index Models via Stochastic Gradient Descent

TL;DR

Abstract

Paper Structure (40 sections, 15 theorems, 98 equations, 2 figures)

This paper contains 40 sections, 15 theorems, 98 equations, 2 figures.

Introduction
Notation.
Main results
Related work
Single-index models.
Generalized linear bandits.
Gradient descent in online learning and bandits.
Organization
Analysis of the SGD update
Analysis of the learning phase
Pure exploration
Regime I: $t\le s< T_0$.
Regime II: $s\ge T_0$.
Regret minimization
Analysis of the burn-in phase
...and 25 more sections

Key Result

Theorem 1.1

Let $\varepsilon,\delta>0$. Under assump:f, let $(a_t, \theta_t)_{t\ge 1}$ be given by the SGD evolution in eq:exploration and eq:SGD-decision-making, with an initialization $\theta_1$ such that $\langle{\theta_1, \theta^\star}\rangle \ge 1-\gamma_0/4$.

Figures (2)

Figure 1: Correlation $m_t = \langle \theta_t, \theta^\star \rangle$ plotted as a function of $t$ in $d=20$ dimensions, for the cubic link $f(x)=x^3$. We run interactive SGD with a constant learning rate $\eta_t = 0.002$ for all $t$, using an exploration schedule with $\sigma_t= 0.5$ until $m_t$ reaches $0.7$ and $\sigma_t=0.2$ thereafter.
Figure 2: An example behavior of SGD for pure exploration in the learning phase (cf. \ref{['lemma:local-improvement-learning-PE']}). For appropriately chosen learning rates, if the correlation hits $m_t \ge 1-\varepsilon$ at time $t$, the SGD dynamics will enjoy the following behaviors with high probability: $(i)$ the trajectory will never degrade too significantly, satisfying $m_s \ge 1 - 2 \varepsilon$ for all $t \le s \le t+\Delta$; $(ii)$ at some time $s=T_0\in [t,t+\Delta]$, $m_s$ improves to at least $1-\frac{\varepsilon}{4}$; and $(iii)$ thereafter, $m_s$ may decrease, but will never fall below $1 - \frac{\varepsilon}{2}$ for all $T_0 \le s \le t + \Delta$.

Theorems & Definitions (22)

Remark 1
Theorem 1.1: Learning Phase
Theorem 1.2: Burn-in Phase
Corollary 1: Overall sample complexity and regret
Lemma 1: Drift
Lemma 2: Martingale difference
Lemma 3: Sum of martingale differences
Lemma 4: Normalization error
Lemma 5: Local improvement for pure exploration
Lemma 6: Local improvement for regret minimization
...and 12 more

Interactive Learning of Single-Index Models via Stochastic Gradient Descent

TL;DR

Abstract

Interactive Learning of Single-Index Models via Stochastic Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (22)