Table of Contents
Fetching ...

Optimal Convergence Rates of Deep Neural Network Classifiers

Zihan Zhang, Lei Shi, Ding-Xuan Zhou

TL;DR

The paper addresses binary classification with a compositional conditional class probability under the Tsybakov noise condition, establishing a dimension-free minimax rate for the excess 0-1 risk. It introduces a novel oracle inequality for Lipschitz loss ERMs using a truncation and a surrogate, enabling sharp generalization bounds. The main results show that ReLU DNNs trained with hinge loss attain the optimal rate up to logarithmic factors, with explicit rates that depend on the compositional structure through $d_*$, $d$, $q$, $eta$, and $s$, but are independent of the ambient dimension $d$; minimax lower bounds corroborate the rates. Practically, these findings justify the effectiveness of hinge-loss trained ReLU DNNs in high-dimensional settings and provide a broadly applicable framework for generalization analysis in structured classification tasks.

Abstract

In this paper, we study the binary classification problem on $[0,1]^d$ under the Tsybakov noise condition (with exponent $s \in [0,\infty]$) and the compositional assumption. This assumption requires the conditional class probability function of the data distribution to be the composition of $q+1$ vector-valued multivariate functions, where each component function is either a maximum value function or a Hölder-$β$ smooth function that depends only on $d_*$ of its input variables. Notably, $d_*$ can be significantly smaller than the input dimension $d$. We prove that, under these conditions, the optimal convergence rate for the excess 0-1 risk of classifiers is $\left( \frac{1}{n} \right)^{\frac{β\cdot(1\wedgeβ)^q}{{\frac{d_*}{s+1}+(1+\frac{1}{s+1})\cdotβ\cdot(1\wedgeβ)^q}}}$, which is independent of the input dimension $d$. Additionally, we demonstrate that ReLU deep neural networks (DNNs) trained with hinge loss can achieve this optimal convergence rate up to a logarithmic factor. This result provides theoretical justification for the excellent performance of ReLU DNNs in practical classification tasks, particularly in high-dimensional settings. The generalized approach is of independent interest.

Optimal Convergence Rates of Deep Neural Network Classifiers

TL;DR

The paper addresses binary classification with a compositional conditional class probability under the Tsybakov noise condition, establishing a dimension-free minimax rate for the excess 0-1 risk. It introduces a novel oracle inequality for Lipschitz loss ERMs using a truncation and a surrogate, enabling sharp generalization bounds. The main results show that ReLU DNNs trained with hinge loss attain the optimal rate up to logarithmic factors, with explicit rates that depend on the compositional structure through , , , , and , but are independent of the ambient dimension ; minimax lower bounds corroborate the rates. Practically, these findings justify the effectiveness of hinge-loss trained ReLU DNNs in high-dimensional settings and provide a broadly applicable framework for generalization analysis in structured classification tasks.

Abstract

In this paper, we study the binary classification problem on under the Tsybakov noise condition (with exponent ) and the compositional assumption. This assumption requires the conditional class probability function of the data distribution to be the composition of vector-valued multivariate functions, where each component function is either a maximum value function or a Hölder- smooth function that depends only on of its input variables. Notably, can be significantly smaller than the input dimension . We prove that, under these conditions, the optimal convergence rate for the excess 0-1 risk of classifiers is , which is independent of the input dimension . Additionally, we demonstrate that ReLU deep neural networks (DNNs) trained with hinge loss can achieve this optimal convergence rate up to a logarithmic factor. This result provides theoretical justification for the excellent performance of ReLU DNNs in practical classification tasks, particularly in high-dimensional settings. The generalized approach is of independent interest.

Paper Structure

This paper contains 13 sections, 10 theorems, 151 equations, 1 figure.

Key Result

Theorem 1

Let $n\in\mathbb N$ and $d\in\mathbb N$. Consider i.i.d. sample $\{(X_i,Y_i)\}_{i=1}^n$ drawn from a distribution $P$ on $[0,1]^d\times\{-1,1\}$ and a nonempty class $\mathcal{F}$ of Borel measurable functions from $[0,1]^d$ to $\mathbb{R}$. Let $F\in(0,\infty)$, $J\in(0,\infty)$ and $\phi:\mathbb{R Suppose that there exist a measurable function $\psi:[0,1]^d\times\{-1,1\}\to\mathbb{R}$ and consta

Figures (1)

  • Figure :

Theorems & Definitions (15)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Lemma B.1
  • proof
  • proof : Proof of Theorem \ref{['231123013855']}
  • Lemma B.2
  • ...and 5 more