Table of Contents
Fetching ...

Untangling Gaussian Mixtures

Eva Fluck, Sandra Kiefer, Christoph Standke

TL;DR

This paper develops a quantitative theory of tangles in data sets drawn from Gaussian mixtures and provides explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely.

Abstract

Tangles were originally introduced as a concept to formalize regions of high connectivity in graphs. In recent years, they have also been discovered as a link between structural graph theory and data science: when interpreting similarity in data sets as connectivity between points, finding clusters in the data essentially amounts to finding tangles in the underlying graphs. This paper further explores the potential of tangles in data sets as a means for a formal study of clusters. Real-world data often follow a normal distribution. Accounting for this, we develop a quantitative theory of tangles in data sets drawn from Gaussian mixtures. To this end, we equip the data with a graph structure that models similarity between the points and allows us to apply tangle theory to the data. We provide explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely. This can be considered as a sufficient formal criterion for the separabability of clusters in the data.

Untangling Gaussian Mixtures

TL;DR

This paper develops a quantitative theory of tangles in data sets drawn from Gaussian mixtures and provides explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely.

Abstract

Tangles were originally introduced as a concept to formalize regions of high connectivity in graphs. In recent years, they have also been discovered as a link between structural graph theory and data science: when interpreting similarity in data sets as connectivity between points, finding clusters in the data essentially amounts to finding tangles in the underlying graphs. This paper further explores the potential of tangles in data sets as a means for a formal study of clusters. Real-world data often follow a normal distribution. Accounting for this, we develop a quantitative theory of tangles in data sets drawn from Gaussian mixtures. To this end, we equip the data with a graph structure that models similarity between the points and allows us to apply tangle theory to the data. We provide explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely. This can be considered as a sufficient formal criterion for the separabability of clusters in the data.
Paper Structure (10 sections, 13 theorems, 49 equations, 5 figures)

This paper contains 10 sections, 13 theorems, 49 equations, 5 figures.

Key Result

Lemma 4

Let $G=(V,E,w)$ be a weighted graph. Let $W \subseteq V$ such that $\lvert W\rvert \geq 2$ and $G[W]$ is a clique. For $w_{W} \coloneqq \min\{w(u,v)\mid\{u,v\}\in E(W)\}$, it holds that $\mathcal{T}_{G}(W)$ is a $\kappa_G$-tangle of order $\frac{2}{9} \cdot \lvert W\rvert^2 \cdot w_{W}$.

Figures (5)

  • Figure 1: Data drawn from two Gaussians with their hidden labels in red and blue. The dashed line represents a possible low-order cut and the purple circles are candidates for highly connected regions. (a) A $\delta$-neigh-bor-hood graph on the data, that is, two data points are adjacent if their distance is at most $\delta$. (b) The fully connected graph with the edge weights dependent on the distance of the data points represented in the opacity.
  • Figure 2: Our computational results in the one-dimensional case. We study the applicability of \ref{['thm:small_n_delta']}, where we choose $\delta$ to optimize the probability bound. In the one-dimensional case, we assume that the means of the marginal distributions have distance $\lambda$ and the standard deviations are chosen to be $1$ and $\alpha$. From the first marginal distribution we draw $rn$ many data points, from the second one, we draw $(1-r)n$. (a) shows the probability bound dependent on $\lambda$ for different $n$ in the base case. In (b), we vary the mixing parameters $r$ and plot the probability bound dependent on $\lambda$ with $n=900$. (c), (d) show plots of the smallest mean distance $\lambda$ such that the conditions of \ref{['thm:limit_delta_1dim']} are met. In (c) we vary the mixing parameter $r$ and in (d), we vary the ratio $\alpha$ between the standard deviations. Fixing $r=1/2$ and $\alpha=1$, we obtain the bounds $\lambda>2.948$ and $\lambda>3.397$, respectively. Note that at some point the bound corresponding to Hoeffding's Inequality (\ref{['cor:hoeffding']}) becomes larger than the bounds relating to \ref{['thm:berry-esseen']}.
  • Figure 3: A schematic image of the higher-dimensional data distribution models. We take marginals with equal $\sigma$ and draw equally many points from each distribution. The red circles represent the means and the associated hyperballs. In blue, we see the low-order cut, in light blue the area of points that possibly contribute to the order of the cut. In (a) we have two marginal distributions. The low-order cut is $S=\{(x_1,\ldots,x_d)\mid x_1\leq \frac{\lambda}{2}\}$. The approximation of the hyperball as a hypercube is shown in dark red. (b) and (c) show 3 distributions whose means are positioned on an equilateral triangle. The cut along the Voronoi cell is shown in (b), in (c) we see the cut along a cube centered at one of the means.
  • Figure 4: Our computations in higher dimensions. First we consider a mixture of two types of distributions with mean distance $\lambda$ (see \ref{['fig:2dim_2distr_datadistr']}). We plot the probability bounds from \ref{['thm:small_n_delta']} dependent on $\lambda$, where the dimension is two in (a) and three in (b). In (c) we plot the smallest $\lambda$ such that the conditions of \ref{['thm:large_n_delta']} are met and we get incomparable tangles a.a.s. dependent on the dimension, using the described approximation. In (d) we take three distributions whose means form an equilateral triangle with side length $\lambda$ (see \ref{['fig:2dim_3distr_datadistr_square']}). We plot the probability bound from \ref{['thm:small_n_delta']} for dimension 2 against the side length $\lambda$. The size of the hypercube is chosen to maximize the resulting lower bound on the probability. Again we see that at some point the bound corresponding to Hoeffding's Inequality (\ref{['cor:hoeffding']}) becomes larger than the bounds relating to \ref{['thm:berry-esseen']}.
  • Figure 5: Computational results for the fully connected graph. The smallest distance of means $\lambda$ such that \ref{['thm:large_n_weight']} can be applied, dependent on the width $\Delta$ of the interval used to define the tangles is shown in (a). Plot (b) shows the results of applying \ref{['thm:small_n_weight']} for different sizes $n$ of the data set. The interval width $\Delta$ is chosen to maximize the probability bound.

Theorems & Definitions (30)

  • Example 1: see grohe2016
  • Definition 2
  • Lemma 4
  • proof
  • Theorem 5: Bienaymé-Chebyshev's inequality bien53che67
  • Theorem 6: Hoeffding's Inequality hoeff63
  • Theorem 7: Berry-Esseen's Inequality (ber41ess45)
  • Definition 8
  • Definition 9
  • Definition 11
  • ...and 20 more