Table of Contents
Fetching ...

Stochastic Gradient Descent for Nonparametric Additive Regression

Xin Chen, Jason M. Klusowski

TL;DR

This work introduces Functional SGD (F-SGD) for online training of nonparametric additive models, providing memory- and time-efficient learning by updating truncated basis coefficients. The authors establish a rigorous oracle inequality and demonstrate minimax-optimal rates over Sobolev ellipsoid function classes, with robustness to model mis-specification and even when the covariate distribution lacks full support. The method achieves favorable scalability relative to kernel methods and competing online approaches, and extends to adaptation via Lepski’s method for unknown smoothness, as well as potential general convex losses. Practically, F-SGD enables streaming-friendly, scalable estimation of high-dimensional additive models with strong theoretical guarantees and empirical efficiency.

Abstract

This paper introduces an iterative algorithm for training nonparametric additive models that enjoys favorable memory storage and computational requirements. The algorithm can be viewed as the functional counterpart of stochastic gradient descent, applied to the coefficients of a truncated basis expansion of the component functions. We show that the resulting estimator satisfies an oracle inequality that allows for model mis-specification. In the well-specified setting, by choosing the learning rate carefully across three distinct stages of training, we demonstrate that its risk is minimax optimal in terms of the dependence on both the dimensionality of the data and the size of the training sample. Unlike past work, we also provide polynomial convergence rates even when the covariates do not have full support on their domain.

Stochastic Gradient Descent for Nonparametric Additive Regression

TL;DR

This work introduces Functional SGD (F-SGD) for online training of nonparametric additive models, providing memory- and time-efficient learning by updating truncated basis coefficients. The authors establish a rigorous oracle inequality and demonstrate minimax-optimal rates over Sobolev ellipsoid function classes, with robustness to model mis-specification and even when the covariate distribution lacks full support. The method achieves favorable scalability relative to kernel methods and competing online approaches, and extends to adaptation via Lepski’s method for unknown smoothness, as well as potential general convex losses. Practically, F-SGD enables streaming-friendly, scalable estimation of high-dimensional additive models with strong theoretical guarantees and empirical efficiency.

Abstract

This paper introduces an iterative algorithm for training nonparametric additive models that enjoys favorable memory storage and computational requirements. The algorithm can be viewed as the functional counterpart of stochastic gradient descent, applied to the coefficients of a truncated basis expansion of the component functions. We show that the resulting estimator satisfies an oracle inequality that allows for model mis-specification. In the well-specified setting, by choosing the learning rate carefully across three distinct stages of training, we demonstrate that its risk is minimax optimal in terms of the dependence on both the dimensionality of the data and the size of the training sample. Unlike past work, we also provide polynomial convergence rates even when the covariates do not have full support on their domain.
Paper Structure (22 sections, 6 theorems, 121 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 22 sections, 6 theorems, 121 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

Suppose Assumptions asmp:1-asmp:4 hold. Furthermore, assume $n \geq C_0 p^{1+1/(2s)}$ where $C_0 > 1$ is a constant. Let $A_1$, $A_2$, and $B$ be constants such that $A_1 = (2s+1)A_2$, $A_2 \geq \frac{2}{C_1}$, and $B \leq \frac{1}{4C_2M^2A_2^2}$. Assume $p \geq \frac{1}{B^{2s}}$. Set $\gamma_i$ and Then the MSE of F-SGD update:l2loss, initialized with $\widehat{f}_0 = 0$, satisfies Here $g_1,g_2

Figures (4)

  • Figure 1: $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$. The benchmark (black line) has slope $-2s/(2s+1)= -4/5$, which represents the minimax optimal rate. Each curve is calculated as the average of 20 repetitions. (a) $\bm{X}$ follows data generating process (a). (b) $\bm{X}$ follows data generating process (b).
  • Figure 2: (a) $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$. The benchmark (black line) has slope $-2s/(2s+1)= -4/5$, which represents the minimax optimal rate in terms of $n$. Vertical dotted lines are used to indicate the second stage for $p = 80$, which spans from $n = 161$ to $n = 239$. (b) $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} p$ when $n=10^5$. The benchmark (black line) has slope 1, which represents the minimax optimal rate in terms of $p$. Each curve is calculated as the average of 20 repetitions.
  • Figure 3: $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$. The benchmark (black line) has slope $-2s/(2s+1)= -4/5$, which represents the minimax optimal rate. Each curve is calculated as the average of 100 repetitions. (a) $X$ is uniformly distributed over [0, 1]. (b) $X$ is uniformly distributed over [0.25, 0.75].
  • Figure 4: $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$, where $\widehat{f}_n$ is the F-SGD with Lepski's method. The benchmark (orange line) has slope $-2s/(2s+1)= -4/5$, which represents the optimal rate. Each curve is calculated as the average of 30 repetitions. (a) $X$ is uniformly distributed over [0, 1]. (b) $X$ is uniformly distributed over [0.25, 0.75].

Theorems & Definitions (13)

  • Theorem 4.1
  • Corollary 4.2
  • Remark 4.3
  • Remark 4.4
  • Theorem 4.5
  • Theorem 4.6
  • Lemma 7.1
  • proof
  • Lemma 7.2
  • proof : Proof of Lemma \ref{['lem:general_bound']}
  • ...and 3 more