Table of Contents
Fetching ...

Interactive Learning of Single-Index Models via Stochastic Gradient Descent

Nived Rajaraman, Yanjun Han

TL;DR

It is shown that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in''phase before entering the ``learning''phase in this setting, and with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal sample complexity and regret guarantees across both phases.

Abstract

Stochastic gradient descent (SGD) is a cornerstone algorithm for high-dimensional optimization, renowned for its empirical successes. Recent theoretical advances have provided a deep understanding of how SGD enables feature learning in high-dimensional nonlinear models, most notably the \textit{single-index model} with i.i.d. data. In this work, we study the sequential learning problem for single-index models, also known as generalized linear bandits or ridge bandits, where SGD is a simple and natural solution, yet its learning dynamics remain largely unexplored. We show that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in'' phase before entering the ``learning'' phase in this setting. Moreover, with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal (or best-known) sample complexity and regret guarantees across both phases, for a broad class of link functions. Our results demonstrate that SGD remains highly competitive for learning single-index models under adaptive data.

Interactive Learning of Single-Index Models via Stochastic Gradient Descent

TL;DR

It is shown that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in''phase before entering the ``learning''phase in this setting, and with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal sample complexity and regret guarantees across both phases.

Abstract

Stochastic gradient descent (SGD) is a cornerstone algorithm for high-dimensional optimization, renowned for its empirical successes. Recent theoretical advances have provided a deep understanding of how SGD enables feature learning in high-dimensional nonlinear models, most notably the \textit{single-index model} with i.i.d. data. In this work, we study the sequential learning problem for single-index models, also known as generalized linear bandits or ridge bandits, where SGD is a simple and natural solution, yet its learning dynamics remain largely unexplored. We show that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in'' phase before entering the ``learning'' phase in this setting. Moreover, with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal (or best-known) sample complexity and regret guarantees across both phases, for a broad class of link functions. Our results demonstrate that SGD remains highly competitive for learning single-index models under adaptive data.
Paper Structure (40 sections, 15 theorems, 98 equations, 2 figures)

This paper contains 40 sections, 15 theorems, 98 equations, 2 figures.

Key Result

Theorem 1.1

Let $\varepsilon,\delta>0$. Under assump:f, let $(a_t, \theta_t)_{t\ge 1}$ be given by the SGD evolution in eq:exploration and eq:SGD-decision-making, with an initialization $\theta_1$ such that $\langle{\theta_1, \theta^\star}\rangle \ge 1-\gamma_0/4$.

Figures (2)

  • Figure 1: Correlation $m_t = \langle \theta_t, \theta^\star \rangle$ plotted as a function of $t$ in $d=20$ dimensions, for the cubic link $f(x)=x^3$. We run interactive SGD with a constant learning rate $\eta_t = 0.002$ for all $t$, using an exploration schedule with $\sigma_t= 0.5$ until $m_t$ reaches $0.7$ and $\sigma_t=0.2$ thereafter.
  • Figure 2: An example behavior of SGD for pure exploration in the learning phase (cf. \ref{['lemma:local-improvement-learning-PE']}). For appropriately chosen learning rates, if the correlation hits $m_t \ge 1-\varepsilon$ at time $t$, the SGD dynamics will enjoy the following behaviors with high probability: $(i)$ the trajectory will never degrade too significantly, satisfying $m_s \ge 1 - 2 \varepsilon$ for all $t \le s \le t+\Delta$; $(ii)$ at some time $s=T_0\in [t,t+\Delta]$, $m_s$ improves to at least $1-\frac{\varepsilon}{4}$; and $(iii)$ thereafter, $m_s$ may decrease, but will never fall below $1 - \frac{\varepsilon}{2}$ for all $T_0 \le s \le t + \Delta$.

Theorems & Definitions (22)

  • Remark 1
  • Theorem 1.1: Learning Phase
  • Theorem 1.2: Burn-in Phase
  • Corollary 1: Overall sample complexity and regret
  • Lemma 1: Drift
  • Lemma 2: Martingale difference
  • Lemma 3: Sum of martingale differences
  • Lemma 4: Normalization error
  • Lemma 5: Local improvement for pure exploration
  • Lemma 6: Local improvement for regret minimization
  • ...and 12 more