Table of Contents
Fetching ...

Support Collapse of Deep Gaussian Processes with Polynomial Kernels for a Wide Regime of Hyperparameters

Daryna Chernobrovkina, Steffen Grünewälder

TL;DR

This work analyzes the priors induced by deep Gaussian processes with polynomial kernels, revealing depth‑dependent averaging that makes priors highly sensitive to layer hyperparameters. Using a Berry‑Esseen framework, it derives a uniform approximation of a DGP by a simple form $S e^Y (g_1(x))^{c_1}$ and establishes explicit log‑moment based bounds, yielding a threshold near $σ\approx 1.88$ that governs whether the depth drives the prior mass toward zero or toward large norms. The results extend from linear to polynomial kernels, providing concrete BE bounds for both identical and non‑identically distributed layer outputs and illustrating with quadratic‑kernel examples how the interaction of factors determines the prior’s behavior. The findings help reconcile observed pathologies with practical performance of DGPs and suggest principled directions for hyperparameter tuning and kernel design, while outlining open questions for extending the theory to broader kernel classes. Overall, the paper offers a quantitative lens on why DGP priors can collapse or explode and how this depends on depth and kernel structure, connecting to related convolutional DGP insights in the literature.

Abstract

We analyze the prior that a Deep Gaussian Process with polynomial kernels induces. We observe that, even for relatively small depths, averaging effects occur within such a Deep Gaussian Process and that the prior can be analyzed and approximated effectively by means of the Berry-Esseen Theorem. One of the key findings of this analysis is that, in the absence of careful hyper-parameter tuning, the prior of a Deep Gaussian Process either collapses rapidly towards zero as the depth increases or places negligible mass on low norm functions. This aligns well with experimental findings and mirrors known results for convolution based Deep Gaussian Processes.

Support Collapse of Deep Gaussian Processes with Polynomial Kernels for a Wide Regime of Hyperparameters

TL;DR

This work analyzes the priors induced by deep Gaussian processes with polynomial kernels, revealing depth‑dependent averaging that makes priors highly sensitive to layer hyperparameters. Using a Berry‑Esseen framework, it derives a uniform approximation of a DGP by a simple form and establishes explicit log‑moment based bounds, yielding a threshold near that governs whether the depth drives the prior mass toward zero or toward large norms. The results extend from linear to polynomial kernels, providing concrete BE bounds for both identical and non‑identically distributed layer outputs and illustrating with quadratic‑kernel examples how the interaction of factors determines the prior’s behavior. The findings help reconcile observed pathologies with practical performance of DGPs and suggest principled directions for hyperparameter tuning and kernel design, while outlining open questions for extending the theory to broader kernel classes. Overall, the paper offers a quantitative lens on why DGP priors can collapse or explode and how this depends on depth and kernel structure, connecting to related convolutional DGP insights in the literature.

Abstract

We analyze the prior that a Deep Gaussian Process with polynomial kernels induces. We observe that, even for relatively small depths, averaging effects occur within such a Deep Gaussian Process and that the prior can be analyzed and approximated effectively by means of the Berry-Esseen Theorem. One of the key findings of this analysis is that, in the absence of careful hyper-parameter tuning, the prior of a Deep Gaussian Process either collapses rapidly towards zero as the depth increases or places negligible mass on low norm functions. This aligns well with experimental findings and mirrors known results for convolution based Deep Gaussian Processes.

Paper Structure

This paper contains 23 sections, 1 theorem, 115 equations, 4 figures.

Key Result

Theorem 1

Given a DGP $g_\ell \circ \ldots \circ g_1$ on $\mathbb{R}$ with $\ell$-layers and corresponding independent GPs $g_1, \ldots, g_\ell$ with covariance functions $k_1(x,y) = (xy + c)^{d_1}$, $c\geq 0$, and $k_i(x,y) = \sigma_i^2(xy)^{d_i}$ where $\sigma_i >0$, $2 \leq i\leq \ell$, and $d_1,\ldots, d_ where and $c_\ell = 1$, $c_i = \sum_{j=i+1}^\ell d_j$, for $1 \leq i \leq \ell-1$. The random sign

Figures (4)

  • Figure 1: (a) The densities of the product of $\ell = 1, 10, 30$ normally distributed random variables with mean $\mu = 0$ and variance $\sigma^2 = 1$ are shown. (b) The probability of the product attaining values around zero for larger $\sigma$ is shown.
  • Figure 2: (a) The probability of the scaled product and the log-normal approximation to attain values above $1/2$ are compared ($\sigma =1$). The plot is complemented by an error bound. (b) The same quantities are compared but on a logarithmic scale (the error bound is omitted).
  • Figure 3: (a) The distribution of the product of $\ell = 1, 10$ and $30$ log-normal random variables with $\sigma=3$ is shown. (b) The probability for the product and the log-normal approximation to attain values above $1/2$ is shown ($\sigma = 3$).
  • Figure 4: (a) Five draws from a DGP with $\ell=30$ layers, a linear kernel, and $\sigma =1$ is shown. Notice the scale of the $y$-axis. (b) As for (a) but with $\sigma=2.5$.

Theorems & Definitions (1)

  • Theorem 1