Classification with Deep Neural Networks and Logistic Loss
Zihan Zhang, Lei Shi, Ding-Xuan Zhou
TL;DR
The paper addresses the challenge of deriving generalization bounds for binary classification with deep ReLU networks trained under the logistic loss, even when the target function $f^*_{\phi,P}$ is unbounded. It introduces an oracle-type inequality using a crafted bivariate function $\psi$ to bound the excess $\phi$-risk without relying on boundedness, enabling sharp rates under Hölder smoothness of the conditional probability $\eta$ and under a compositional structure that yields dimension-free rates. The authors establish optimal convergence rates for the excess logistic risk $\mathcal{E}_P^\phi(\hat{f}_n^{\mathbf{FNN}})$ on the order of $O\left(\left(\frac{(\log n)^5}{n}\right)^{\beta/(\beta+d)}\right)$ (up to log factors), with corresponding misclassification rates via calibration and minimax lower bounds confirming near-optimality. Collectively, these results deepen theoretical understanding of DNN-based binary classification with logistic loss and offer insight into why high-dimensional problems can be effectively solved by deep networks in practice.
Abstract
Deep neural networks (DNNs) trained with the logistic loss (i.e., the cross entropy loss) have made impressive advancements in various binary classification tasks. However, generalization analysis for binary classification with DNNs and logistic loss remains scarce. The unboundedness of the target function for the logistic loss is the main obstacle to deriving satisfactory generalization bounds. In this paper, we aim to fill this gap by establishing a novel and elegant oracle-type inequality, which enables us to deal with the boundedness restriction of the target function, and using it to derive sharp convergence rates for fully connected ReLU DNN classifiers trained with logistic loss. In particular, we obtain optimal convergence rates (up to log factors) only requiring the Hölder smoothness of the conditional class probability $η$ of data. Moreover, we consider a compositional assumption that requires $η$ to be the composition of several vector-valued functions of which each component function is either a maximum value function or a Hölder smooth function only depending on a small number of its input variables. Under this assumption, we derive optimal convergence rates (up to log factors) which are independent of the input dimension of data. This result explains why DNN classifiers can perform well in practical high-dimensional classification problems. Besides the novel oracle-type inequality, the sharp convergence rates given in our paper also owe to a tight error bound for approximating the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In addition, we justify our claims for the optimality of rates by proving corresponding minimax lower bounds. All these results are new in the literature and will deepen our theoretical understanding of classification with DNNs.
