A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data
Saptarshi Chakraborty, Peter L. Bartlett
TL;DR
This paper analyzes finite-sample generalization for supervised learning where the conditional response given the input lies in a regular exponential family with a Hölder-β mean function. By introducing the entropic dimension $ar{d}_{2\beta}(\lambda)$ to quantify intrinsic data dimensionality, the authors derive rates that depend on this intrinsic measure rather than the ambient input dimension. They establish sharp upper bounds for bounded-density distributions, showing a polynomial-in-dimension rate $\tilde{\mathcal{O}}\left(d^{\frac{2\lfloor\beta\rfloor(\beta+d)}{2\beta+d}} n^{-\frac{2\beta}{2\beta+d}}\right)$ and a minimax lower bound of $\tilde{\mathcal{O}}\left(n^{-\frac{2\beta}{2\beta+d}}\right)$, and they extend to intrinsically low-dimensional data with rates $\tilde{O}\left(n^{-\frac{2\beta}{2\beta+\bar{d}_{2\beta}(\lambda)}}\right)$. The analysis leverages ReLU networks, Bregman losses corresponding to the exponential-family form, and an oracle-inequality framework to separate approximation and generalization errors, achieving near-optimal rates under realistic assumptions. These results generalize classical Gaussian-noise regression and improve upon existing intrinsic-dimension bounds by employing the entropic dimension. The findings have implications for understanding when deep supervised learners can achieve minimax-optimal rates in high-dimensional but structured data settings, and they clarify how density bounds and intrinsic dimensionality shape convergence.
Abstract
Recent advances have revealed that the rate of convergence of the expected test error in deep supervised learning decays as a function of the intrinsic dimension and not the dimension $d$ of the input space. Existing literature defines this intrinsic dimension as the Minkowski dimension or the manifold dimension of the support of the underlying probability measures, which often results in sub-optimal rates and unrealistic assumptions. In this paper, we consider supervised deep learning when the response given the explanatory variable is distributed according to an exponential family with a $β$-Hölder smooth mean function. We consider an entropic notion of the intrinsic data-dimension and demonstrate that with $n$ independent and identically distributed samples, the test error scales as $\tilde{\mathcal{O}}\left(n^{-\frac{2β}{2β+ \bar{d}_{2β}(λ)}}\right)$, where $\bar{d}_{2β}(λ)$ is the $2β$-entropic dimension of $λ$, the distribution of the explanatory variables. This improves on the best-known rates. Furthermore, under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as $\tilde{\mathcal{O}}\left( d^{\frac{2\lfloorβ\rfloor(β+ d)}{2β+ d}}n^{-\frac{2β}{2β+ d}}\right)$, establishing that the dependence on $d$ is not exponential but at most polynomial. We also demonstrate that when the explanatory variable has a lower bounded density, this rate in terms of the number of data samples, is nearly optimal for learning the dependence structure for exponential families.
