Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets

Khen Cohen; Noam Levi; Yaron Oz

Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets

Khen Cohen, Noam Levi, Yaron Oz

TL;DR

The paper derives Bayes-optimal decision boundaries for binary classification under overlapping high-dimensional Gaussian mixtures, detailing population and empirical limits and clarifying how covariance eigenvalues and eigenvectors shape the boundaries. It then shows neural networks, including two-layer quadratic-activation models, can closely approximate these Bayes rules, with KKT convergence offering a theoretical lens on gradient dynamics. Through toy models and real-data experiments (FMNIST, CIFAR-10) with covariance-flip and spectral tests, the authors demonstrate that eigenvectors, more than eigenvalues, largely determine decision thresholds in high dimensions, providing a principled link between GMM theory and neural network behavior. The findings illuminate how neural networks distill probabilistic structure from complex distributions and offer practical intuition for when and why covariance geometry guides learned classifiers.

Abstract

We derive closed-form expressions for the Bayes optimal decision boundaries in binary classification of high dimensional overlapping Gaussian mixture model (GMM) data, and show how they depend on the eigenstructure of the class covariances, for particularly interesting structured data. We empirically demonstrate, through experiments on synthetic GMMs inspired by real-world data, that deep neural networks trained for classification, learn predictors which approximate the derived optimal classifiers. We further extend our study to networks trained on authentic data, observing that decision thresholds correlate with the covariance eigenvectors rather than the eigenvalues, mirroring our GMM analysis. This provides theoretical insights regarding neural networks' ability to perform probabilistic inference and distill statistical patterns from intricate distributions.

Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets

TL;DR

Abstract

Paper Structure (21 sections, 1 theorem, 21 equations, 14 figures)

This paper contains 21 sections, 1 theorem, 21 equations, 14 figures.

Introduction
Background and Related Work
Overlapping Gaussian Mixtures in High Dimensions
Optimal Classification on Population Data
Empirical Optimal Classification
Analysis for a Toy Model of Complex Data
Diagonal Correlated Covariances
Rotated Correlated Covariances
Neural Networks as Nearly Optimal Classifiers
Neural Network Classifier
Karush–Kuhn–Tucker (KKT) Convergence
Results Extending to Realistic Data and Networks
$\Delta \alpha$ Tests on GMMs
Flip Tests on GMMs Constructed from Real Data
Flip Tests on Real Data
...and 6 more sections

Key Result

Theorem B.1

Let $\Phi({\boldsymbol{\theta}};\cdot)$ be a homogeneous ReLU neural network. Consider minimizing the logistic loss over a binary classification dataset $\{(\mathbf{x}_a,y_a)\}_{a=1}^N$ using gradient flow. Assume that there exists time $t_0$ such that ${\cal L}({\boldsymbol{\theta}}(t_0))<1$This en Moreover, ${\cal L}({\boldsymbol{\theta}}(t)) \to 0$ as $t \to \infty$.

Figures (14)

Figure 1: Probability density function of $\beta(x)$, evaluated on GMMs with different class covariance matrices. Blue and red bins indicate samples drawn from the classes $A$ and $B$, respectively. Solid curves represent the numerical evaluation of \ref{['eq:beta_def']} as a generalized $\chi^2$ distribution. Dots indicate the values given by \ref{['eq:mean_classes_general']}, as well as by \ref{['eq:BOC_spectral_diff']}. Left:$\beta$ distribution for covariances with the same basis but different spectra. Center:$\beta$ distribution for covariances with the same spectrum but different random bases. Right:$\beta$ distribution comparison for covariances with both different spectra and different bases, shown in gray. Here, we take $\alpha_A = 0.5, \Delta \alpha = -0.3, d=100$.
Figure 2: Comparison between BOC and quadratic network trained on data with different covariance spectrum. Left: accuracy, measured as the sigmoid function acting on the network/BOC output. Center: network/BOC output distribution. Blue indicates network results while red is the BOC prediction. Here, the network is trained with $d_h=100$ and $N=200k$ to $91.27\%$ training accuracy and reaches $91.1\%$ with $\alpha_A=0.2$ and $\alpha_B=0.3$. Right: Results for KKT convergence of a quadratic NN trained on an $\alpha=0.2$ correlated GMM, where the class covariances share the same spectrum but are given a different random basis. Red indicates the BOC prediction, blue is the quadratic network output before the softmax function, when trained with $d_h=100$, and green are the KKT predictions with $\lambda_a \propto 1/N$ and $d_h=100$. The data dimensions are similar to the main text, i.e., $d=100$, using $N=100k$ samples, the network reached $100\%$ training and $99.9\%$ test accuracy.
Figure 3: Two Gaussian classification task, with the same eigenvectors but different eigenvalues bulk: the first one with $\alpha=0.5$ and the second with $\alpha+\Delta \alpha$. The model is FC, averaged over 5 training runs - the solid line represents the mean accuracy and shaded area represents 1-sigma error bars.
Figure 4: The classification flipping test on the FMNIST and CIFAR10 datasets between class 0 and other classes. Top row: training and tests on GMMs generated from the class covariance matrices. Middle row: training and tests on real images after whitening, rescaling and coloring. Bottom row: eigenvector threshold at the flipping point, for FMNIST and CIFAR10, using GMMs and real images, when either training with a NN or predicting with the BOC. Blue columns indicate the flipping point on GMM classification, red indicates the same test on real images, and green shows the predictions of the BOC given only by information on the covariance matrices, as given by \ref{['eq:beta_def']}. The BOC is only shown for CIFAR10, due to numerical instabilities.
Figure 5: Results for neural collapse convergence for training ResNet18 on two classes (0,7) sampled from Gaussian versions of CIFAR10 data. Left: setting $\mu_A=\mu_B=0$. Right: $\mu_A\neq \mu_B$, as given by the class means. The network attains 99.9% test accuracy in both cases. We see no substantial difference in convergence to neural collapse metrics, affirming our claims in the main text.
...and 9 more figures

Theorems & Definitions (1)

Theorem B.1: Paraphrased from lyu2020gradientji2020directional

Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets

TL;DR

Abstract

Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (1)