Deep Horseshoe Gaussian Processes
Ismaël Castillo, Thibault Randrianarisoa
TL;DR
This paper introduces the Deep Horseshoe Gaussian Process (Deep-HGP) prior, a simple Bayesian nonparametric construction that stacks Gaussian-process layers with lengthscales drawn from half-Horseshoe priors to enable both adaptation to the smoothness of the regression function and soft, data-driven variable selection across high-dimensional inputs. A key novelty is the freezing-of-paths mechanism, where shrinking irrelevant coordinates' lengthscales drives near-constant behavior along those directions, effectively reducing active dimensionality without hard model selection. The authors establish near-minimax posterior contraction rates that adapt to both smoothness and compositional structure, including dimension-dependent bounds that allow ambient dimension to grow with sample size, and they extend results to both shallow and multilayer Deep-HGP priors as well as to standard posteriors via augmented priors. This yields theoretically grounded, scalable Bayesian priors for complex, high-dimensional regression with compositional structure, with practical implications for uncertainty quantification and model interpretation in deep Bayesian nonparametrics.
Abstract
Deep Gaussian processes have recently been proposed as natural objects to fit, similarly to deep neural networks, possibly complex features present in modern data samples, such as compositional structures. Adopting a Bayesian nonparametric approach, it is natural to use deep Gaussian processes as prior distributions, and use the corresponding posterior distributions for statistical inference. We introduce the deep Horseshoe Gaussian process Deep-HGP, a new simple prior based on deep Gaussian processes with a squared-exponential kernel, that in particular enables data-driven choices of the key lengthscale parameters. For nonparametric regression with random design, we show that the associated posterior distribution recovers the unknown true regression curve optimally in terms of quadratic loss, up to a logarithmic factor, in an adaptive way. The convergence rates are simultaneously adaptive to both the smoothness of the regression function and to its structure in terms of compositions. The dependence of the rates in terms of dimension are explicit, allowing in particular for input spaces of dimension increasing with the number of observations.
