The Optimality of Kernel Classifiers in Sobolev Space

Jianfa Lai; Zhifan Li; Dongming Huang; Qian Lin

The Optimality of Kernel Classifiers in Sobolev Space

Jianfa Lai, Zhifan Li, Dongming Huang, Qian Lin

TL;DR

This work analyzes binary classification in reproducing kernel Hilbert spaces by studying gradient-flow kernel classifiers trained with spectral algorithms. Under a source condition that the Bayes function lies in an interpolation space $[\mathcal{H}]^s$ and with eigenvalue decay rate $\beta$, it proves an upper bound on the classification excess risk of $O(n^{-s\beta/(2s\beta+2)})$ and establishes a matching minimax lower bound in Sobolev spaces, demonstrating minimax optimality. The results extend to neural-network generalization through the neural tangent kernel and are complemented by a practical smoothness-estimation method that adapts to real datasets. The paper also provides comprehensive appendix-type arguments detailing bounds, embeddings, and the various technical steps needed to support the main theorems, highlighting limitations and extensions to complex function structures.

Abstract

Kernel methods are widely used in machine learning, especially for classification problems. However, the theoretical analysis of kernel classification is still limited. This paper investigates the statistical performances of kernel classifiers. With some mild assumptions on the conditional probability $η(x)=\mathbb{P}(Y=1\mid X=x)$, we derive an upper bound on the classification excess risk of a kernel classifier using recent advances in the theory of kernel regression. We also obtain a minimax lower bound for Sobolev spaces, which shows the optimality of the proposed classifier. Our theoretical results can be extended to the generalization error of overparameterized neural network classifiers. To make our theoretical results more applicable in realistic settings, we also propose a simple method to estimate the interpolation smoothness of $2η(x)-1$ and apply the method to real datasets.

The Optimality of Kernel Classifiers in Sobolev Space

TL;DR

and with eigenvalue decay rate

, it proves an upper bound on the classification excess risk of

and establishes a matching minimax lower bound in Sobolev spaces, demonstrating minimax optimality. The results extend to neural-network generalization through the neural tangent kernel and are complemented by a practical smoothness-estimation method that adapts to real datasets. The paper also provides comprehensive appendix-type arguments detailing bounds, embeddings, and the various technical steps needed to support the main theorems, highlighting limitations and extensions to complex function structures.

Abstract

, we derive an upper bound on the classification excess risk of a kernel classifier using recent advances in the theory of kernel regression. We also obtain a minimax lower bound for Sobolev spaces, which shows the optimality of the proposed classifier. Our theoretical results can be extended to the generalization error of overparameterized neural network classifiers. To make our theoretical results more applicable in realistic settings, we also propose a simple method to estimate the interpolation smoothness of

and apply the method to real datasets.

Paper Structure (35 sections, 19 theorems, 89 equations, 2 figures, 1 table)

This paper contains 35 sections, 19 theorems, 89 equations, 2 figures, 1 table.

Introduction
Our contribution
Related works
Preliminaries
Interpolation Space of RKHS
Fractional Sobolev Space and Sobolev RKHS
Kernel Classifiers: Spectra Algorithm
Notations.
Main Results
Assumptions
Minimax optimality of kernel classifiers
Applications in Neural Networks
Estimation of smoothness
Determination of $s$.
Estimation of $s$ in regression.
...and 20 more sections

Key Result

Theorem 3.1

Suppose $f^*_\rho\in [H^r(\mathcal{X})]^s$ for $s>0$, where $H^r$ is the Sobolev RKHS. For all learning methods $\hat{f}$, for any fixed $\delta \in (0,1)$, when $n$ is sufficiently large, there is a distribution $\rho \in \mathcal{P}$ such that, with probability at least $1 - \delta$, we have where $C$ is a universal constant.

Figures (2)

Figure 1: Experiments for estimating the smoothness parameter $s$ in regression settings. (a) Naive estimation based on $2,000$ sample points for $\sigma=0$ (blue) and $\sigma=0.1$ (orange). (b) Truncation Estimation based on $2,000$ sample points with truncation point $100$. In both plots (a) and (b), the $x$-axis is the logarithmic index $j$ and the $y$-axis is the logarithmic $p_j$. (c) Truncation Estimation across various values of sample size $n$, each repeated 50 times. The blue line represents the average of estimates, the shaded area represents one standard deviation, and the true value is indicated by the orange dashed line.
Figure 2: Experiments for estimating the smoothness parameter $s$ in classification settings. (a) The experiment uses $5,000$ sample points and the truncation point is 100. (b) Truncation Estimation across various values of sample size $n$, each repeated 50 times. The blue line represents the average of estimates, the shaded area represents one standard deviation, and the true value is indicated by the orange dashed line.

Theorems & Definitions (34)

Definition 1: Filter function
Definition 2: spectral algorithm
Example 1: Classifier with Gradient flow
Theorem 3.1: Lower Bound
Theorem 3.2: Upper Bound
Proposition 4.1: Theorem 1 in li2023statistical
Corollary 4.2
Definition 3: Filter function
Definition 4: spectral algorithm
Lemma 7.1
...and 24 more

The Optimality of Kernel Classifiers in Sobolev Space

TL;DR

Abstract

The Optimality of Kernel Classifiers in Sobolev Space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (34)