Table of Contents
Fetching ...

Universal Consistency of Wide and Deep ReLU Neural Networks and Minimax Optimal Convergence Rates for Kolmogorov-Donoho Optimal Function Classes

Hyunouk Ko, Xiaoming Huo

TL;DR

This work proves that wide and deep ReLU neural networks trained with the logistic loss achieve universal strong consistency across all distributions for binary classification, and it derives minimax-optimal convergence rates for neural-network classifiers under Kolmogorov-Donoho function classes with finite exponents. The authors move beyond traditional smoothness assumptions by leveraging Kolmogorov-Donoho approximation theory and a sieve construction over networks, showing that interpolating classifiers with benign overfitting can attain optimal rates under Tsybakov noise and regularity on the input distribution. They present two main rate theorems: one for standard feedforward nets and another for skip-connected nets with mixed activations, both yielding minimax rates up to polylog factors for function classes such as Hölder and Besov. The results provide a unified framework linking universal consistency, approximation power of neural networks, and information-theoretic function class exponents, with implications for understanding when deep nets can be statistically optimal in broad, nonparametric settings.

Abstract

In this paper, we prove the universal consistency of wide and deep ReLU neural network classifiers trained on the logistic loss. We also give sufficient conditions for a class of probability measures for which classifiers based on neural networks achieve minimax optimal rates of convergence. The result applies to a wide range of known function classes. In particular, while most previous works impose explicit smoothness assumptions on the regression function, our framework encompasses more general settings. The proposed neural networks are either the minimizers of the logistic loss or the $0$-$1$ loss. In the former case, they are interpolating classifiers that exhibit a benign overfitting behavior.

Universal Consistency of Wide and Deep ReLU Neural Networks and Minimax Optimal Convergence Rates for Kolmogorov-Donoho Optimal Function Classes

TL;DR

This work proves that wide and deep ReLU neural networks trained with the logistic loss achieve universal strong consistency across all distributions for binary classification, and it derives minimax-optimal convergence rates for neural-network classifiers under Kolmogorov-Donoho function classes with finite exponents. The authors move beyond traditional smoothness assumptions by leveraging Kolmogorov-Donoho approximation theory and a sieve construction over networks, showing that interpolating classifiers with benign overfitting can attain optimal rates under Tsybakov noise and regularity on the input distribution. They present two main rate theorems: one for standard feedforward nets and another for skip-connected nets with mixed activations, both yielding minimax rates up to polylog factors for function classes such as Hölder and Besov. The results provide a unified framework linking universal consistency, approximation power of neural networks, and information-theoretic function class exponents, with implications for understanding when deep nets can be statistically optimal in broad, nonparametric settings.

Abstract

In this paper, we prove the universal consistency of wide and deep ReLU neural network classifiers trained on the logistic loss. We also give sufficient conditions for a class of probability measures for which classifiers based on neural networks achieve minimax optimal rates of convergence. The result applies to a wide range of known function classes. In particular, while most previous works impose explicit smoothness assumptions on the regression function, our framework encompasses more general settings. The proposed neural networks are either the minimizers of the logistic loss or the - loss. In the former case, they are interpolating classifiers that exhibit a benign overfitting behavior.
Paper Structure (19 sections, 4 theorems, 66 equations)

This paper contains 19 sections, 4 theorems, 66 equations.

Key Result

Lemma 3.1

Let $(A,\mathcal{A})$ be a measurable space and $B$ a compact, metrizable topological space. Assume $m(\cdot,\cdot): A \times B \rightarrow \mathbb{R}$ is measurable in the first argument and continuous in the second argument. Then, there exists a Borel measurable mapping $\widehat{f}: A \rightarrow

Theorems & Definitions (18)

  • Definition 2.1: Kolmogorov-Donoho optimal exponent
  • Definition 2.2: Effective best $M$-term approximation rate
  • Definition 2.3: Effective best $M$-weight approximation rate
  • Definition 2.4
  • Lemma 3.1
  • Theorem 3.2
  • Remark 3.3
  • Remark 4.1
  • Remark 4.2
  • Theorem 4.3
  • ...and 8 more