Stochastic Gradient Descent for Nonparametric Additive Regression

Xin Chen; Jason M. Klusowski

Stochastic Gradient Descent for Nonparametric Additive Regression

Xin Chen, Jason M. Klusowski

TL;DR

This work introduces Functional SGD (F-SGD) for online training of nonparametric additive models, providing memory- and time-efficient learning by updating truncated basis coefficients. The authors establish a rigorous oracle inequality and demonstrate minimax-optimal rates over Sobolev ellipsoid function classes, with robustness to model mis-specification and even when the covariate distribution lacks full support. The method achieves favorable scalability relative to kernel methods and competing online approaches, and extends to adaptation via Lepski’s method for unknown smoothness, as well as potential general convex losses. Practically, F-SGD enables streaming-friendly, scalable estimation of high-dimensional additive models with strong theoretical guarantees and empirical efficiency.

Abstract

This paper introduces an iterative algorithm for training nonparametric additive models that enjoys favorable memory storage and computational requirements. The algorithm can be viewed as the functional counterpart of stochastic gradient descent, applied to the coefficients of a truncated basis expansion of the component functions. We show that the resulting estimator satisfies an oracle inequality that allows for model mis-specification. In the well-specified setting, by choosing the learning rate carefully across three distinct stages of training, we demonstrate that its risk is minimax optimal in terms of the dependence on both the dimensionality of the data and the size of the training sample. Unlike past work, we also provide polynomial convergence rates even when the covariates do not have full support on their domain.

Stochastic Gradient Descent for Nonparametric Additive Regression

TL;DR

Abstract

Paper Structure (22 sections, 6 theorems, 121 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 22 sections, 6 theorems, 121 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Nonparametric Additive Models
Stochastic Gradient Descent
Contributions
Preliminaries
Function Spaces
Assumptions
Functional Stochastic Gradient Descent
Main Results
Oracle Inequality
Comparisons with Prior Estimators
Comparisons with Online Smooth Backfitting
Comparisons with Sieve-SGD
Comparisons with Other Reproducing Kernel Methods
Numerical Experiments
...and 7 more sections

Key Result

Theorem 4.1

Suppose Assumptions asmp:1-asmp:4 hold. Furthermore, assume $n \geq C_0 p^{1+1/(2s)}$ where $C_0 > 1$ is a constant. Let $A_1$, $A_2$, and $B$ be constants such that $A_1 = (2s+1)A_2$, $A_2 \geq \frac{2}{C_1}$, and $B \leq \frac{1}{4C_2M^2A_2^2}$. Assume $p \geq \frac{1}{B^{2s}}$. Set $\gamma_i$ and Then the MSE of F-SGD update:l2loss, initialized with $\widehat{f}_0 = 0$, satisfies Here $g_1,g_2

Figures (4)

Figure 1: $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$. The benchmark (black line) has slope $-2s/(2s+1)= -4/5$, which represents the minimax optimal rate. Each curve is calculated as the average of 20 repetitions. (a) $\bm{X}$ follows data generating process (a). (b) $\bm{X}$ follows data generating process (b).
Figure 2: (a) $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$. The benchmark (black line) has slope $-2s/(2s+1)= -4/5$, which represents the minimax optimal rate in terms of $n$. Vertical dotted lines are used to indicate the second stage for $p = 80$, which spans from $n = 161$ to $n = 239$. (b) $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} p$ when $n=10^5$. The benchmark (black line) has slope 1, which represents the minimax optimal rate in terms of $p$. Each curve is calculated as the average of 20 repetitions.
Figure 3: $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$. The benchmark (black line) has slope $-2s/(2s+1)= -4/5$, which represents the minimax optimal rate. Each curve is calculated as the average of 100 repetitions. (a) $X$ is uniformly distributed over [0, 1]. (b) $X$ is uniformly distributed over [0.25, 0.75].
Figure 4: $\log_{10} \| \widehat{f}_n - f\|^2$ against $\log_{10} n$, where $\widehat{f}_n$ is the F-SGD with Lepski's method. The benchmark (orange line) has slope $-2s/(2s+1)= -4/5$, which represents the optimal rate. Each curve is calculated as the average of 30 repetitions. (a) $X$ is uniformly distributed over [0, 1]. (b) $X$ is uniformly distributed over [0.25, 0.75].

Theorems & Definitions (13)

Theorem 4.1
Corollary 4.2
Remark 4.3
Remark 4.4
Theorem 4.5
Theorem 4.6
Lemma 7.1
proof
Lemma 7.2
proof : Proof of Lemma \ref{['lem:general_bound']}
...and 3 more

Stochastic Gradient Descent for Nonparametric Additive Regression

TL;DR

Abstract

Stochastic Gradient Descent for Nonparametric Additive Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (13)