Table of Contents
Fetching ...

Classification of Heavy-tailed Features in High Dimensions: a Superstatistical Approach

Urte Adomaityte, Gabriele Sicuro, Pierpaolo Vivo

TL;DR

This work characterises the learning of a mixture of two clouds of data points with generic centroids via empirical risk minimisation in the high dimensional regime, under the assumptions of generic convex loss and convex regularisation, and analytically characterise the separability transition.

Abstract

We characterise the learning of a mixture of two clouds of data points with generic centroids via empirical risk minimisation in the high dimensional regime, under the assumptions of generic convex loss and convex regularisation. Each cloud of data points is obtained via a double-stochastic process, where the sample is obtained from a Gaussian distribution whose variance is itself a random parameter sampled from a scalar distribution $\varrho$. As a result, our analysis covers a large family of data distributions, including the case of power-law-tailed distributions with no covariance, and allows us to test recent "Gaussian universality" claims. We study the generalisation performance of the obtained estimator, we analyse the role of regularisation, and we analytically characterise the separability transition.

Classification of Heavy-tailed Features in High Dimensions: a Superstatistical Approach

TL;DR

This work characterises the learning of a mixture of two clouds of data points with generic centroids via empirical risk minimisation in the high dimensional regime, under the assumptions of generic convex loss and convex regularisation, and analytically characterise the separability transition.

Abstract

We characterise the learning of a mixture of two clouds of data points with generic centroids via empirical risk minimisation in the high dimensional regime, under the assumptions of generic convex loss and convex regularisation. Each cloud of data points is obtained via a double-stochastic process, where the sample is obtained from a Gaussian distribution whose variance is itself a random parameter sampled from a scalar distribution . As a result, our analysis covers a large family of data distributions, including the case of power-law-tailed distributions with no covariance, and allows us to test recent "Gaussian universality" claims. We study the generalisation performance of the obtained estimator, we analyse the role of regularisation, and we analytically characterise the separability transition.
Paper Structure (30 sections, 95 equations, 8 figures)

This paper contains 30 sections, 95 equations, 8 figures.

Figures (8)

  • Figure 1: Test error $\epsilon_g$ and (solid line, top), training error $\epsilon_t$ (center) and training loss $\epsilon_\ell$ (bottom) as predicted by Eq. \ref{['eq:errori']} in the balanced $\rho=1/2$ case. The dataset distribution is parametrised as in Eq. \ref{['eq:invgamma0']}. The classification task is solved using a quadratic loss with ridge regularisation with $\lambda=10^{-5}$. In the top figure, the dashed line corresponds to the Bayes optimal bound. Dots correspond to the average outcome of $50$ numerical experiments in dimension $d=10^3$. In our parametrisation, the population covariance is ${\boldsymbol{{\Sigma}}}={\boldsymbol{{I}}}_d$ for all values of $a$ and moreover, for $a\to+\infty$, the case of Gaussian clouds with the same centroids and covariance is recovered. For further details on the numerical solutions, see Appendix \ref{['app:numerics']}.
  • Figure 2: Test error $\epsilon_g$ (top), training error $\epsilon_t$ (center) and training loss $\epsilon_\ell$ (bottom) via logistic loss training on balanced clusters parametrised as in Eq. \ref{['eq:invgamma0']} (${\boldsymbol{{\Sigma}}}={\boldsymbol{{I}}}_d$). A ridge regularisation with $\lambda=10^{-4}$ is adopted. Dots correspond to the average over $20$ numerical experiments with $d=10^3$. The Gaussian limit is recovered for $a\to+\infty$. Further details on the numerical solutions can be found in Appendix \ref{['app:numerics']}.
  • Figure 3: Test error $\epsilon_g$ in the classification of two balanced clouds, via quadratic loss (left) and logistic loss (right). In both cases, ridge regularisation is adopted ($\lambda=10^{-4}$ for the square loss case, $\lambda=10^{-3}$ for the logistic loss case). Each cloud is a superposition of a power-law distribution with infinite variance and a Gaussian with covariance ${\boldsymbol{{\Sigma}}}={\boldsymbol{{I}}}_d$. The parameter $r$ allows us to contaminate the purely Gaussian case ($r=0$) with an infinite-variance contribution ($0<r\leq 1$) as in Eq. \ref{['eq:invgamma']} with $c=1$ and $a=1/2$ (left) or $a=3/4$ (right). Dots correspond to the average test error of $20$ numerical experiments in dimension $d=10^3$. Note that, at a given sample complexity, Gaussian clouds are associated with the lowest test error for both losses.
  • Figure 4: (Left) Test error for ridge regularised quadratic loss for various regularisation strengths. The data points of each cloud in the training set are distributed as in Eq. \ref{['eq:invgamma0']}, with shape parameter $a=2$, for balanced clusters (top) and unbalanced clusters ($\rho=1/4$, bottom). Points are the results of $50$ numerical experiments, and the dashed lines are Bayes-optimal bounds. (Center) Test error for different regularisation strengths $\lambda$ for two balanced clusters with quadratic loss at sample complexity $\alpha=2$ using the data distribution \ref{['eq:invgamma0']}. The optimal regularisation strength value $\lambda^\star$ obtained from averaging 5 runs for each $a$ is marked with a cross. (Right) Optimal regularisation strength $\lambda^\star$ at $\alpha=2$ for different values of $a\in[1.5,10^2]$ for both balanced and unbalanced clusters, obtained from averaging 5 runs. Note that, for $\rho=1/2$, $\lambda^\star\to+\infty$ as $a\to+\infty$.
  • Figure 5: Separability threshold $\alpha^\star$ obtained by solving the equations in Eq. \ref{['spgeneral']} with logistic loss, ridge regularisation strength $\lambda=10^{-5}$ and $\rho=1/2$. The data points of each cloud are distributed around their mean ${\boldsymbol{{\mu}}}$ with a two-parameter distribution as in Eq. \ref{['eq:invgamma_p']}. (Left). Finite covariance ${\boldsymbol{{\Sigma}}}=\sigma^2{\boldsymbol{{I}}}_d$ case, $\sigma^2=\frac{c}{a-1}$, as a function of $a$. Dashed lines are the threshold values of the Gaussian case derived by mignacco20a. At large $a$ and large variance $\sigma^2$, the Cover's transition $\alpha=2$ for balanced clusters is recovered. (Right) Infinite-variance data clusters case. The cluster distribution is obtained by fixing $0<a<1$ in Eq. \ref{['eq:invgamma_p']}.
  • ...and 3 more figures