Table of Contents
Fetching ...

Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models

Daniel Klötzl, Ozan Tastekin, David Hägele, Marina Evers, Daniel Weiskopf

TL;DR

The paper tackles visualizing uncertainty in high-dimensional data when underlying distributions are non-Gaussian. It generalizes uncertainty-aware PCA by projecting PDFs rather than merely first and second moments, modeling the data with Gaussian Mixture Models and introducing a weighted variant (wGMM-UAPCA). Core contributions include a closed-form projected GMM pdf, aggregated means and covariances for GMMs, a weighted projection framework, PDF contour visualization, and comprehensive evaluation on real and analytic datasets, showing improved fidelity over traditional UAPCA and KDE baselines. The approach enables more faithful, interactive analysis of uncertain data in low-dimensional spaces and is applicable across diverse domains requiring detailed uncertainty visualization.

Abstract

Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.

Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models

TL;DR

The paper tackles visualizing uncertainty in high-dimensional data when underlying distributions are non-Gaussian. It generalizes uncertainty-aware PCA by projecting PDFs rather than merely first and second moments, modeling the data with Gaussian Mixture Models and introducing a weighted variant (wGMM-UAPCA). Core contributions include a closed-form projected GMM pdf, aggregated means and covariances for GMMs, a weighted projection framework, PDF contour visualization, and comprehensive evaluation on real and analytic datasets, showing improved fidelity over traditional UAPCA and KDE baselines. The approach enables more faithful, interactive analysis of uncertain data in low-dimensional spaces and is applicable across diverse domains requiring detailed uncertainty visualization.

Abstract

Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.

Paper Structure

This paper contains 27 sections, 24 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Contour identification follows a three-step process: vectorize the flattened PDF values (left), sort them in descending order (middle), and compute the cumulative density function (cdf) by summation of the sorted values (right). For illustration, discrete PDF values are labeled alphabetically (A, B, ...). The dotted colored lines indicate contour levels at $\rho \in \{25\%, 50\%, 95\%\}$, highlighting grid cells whose cdf exceeds the corresponding thresholds.
  • Figure 2: Visualizations of the projections for different datasets and projection methods. Each dataset has two classes, and the projection of the original sample points is shown as a reference, together with contour lines representing the isolines of the PDF. For demonstration and comparison, we use the wGMM-UAPCA projection matrix to consistently project each dataset, i.e., the samples, multivariate normal distributions, and GMMs.
  • Figure 3: The projection of the Fashion-MNIST dataset with 10 classes. The first column shows all classes together for UAPCA (top) and wGMM-UAPCA (bottom), while the following columns show only individual classes to avoid overplotting.
  • Figure 4: The student dataset's trapezoidal (a, for student David and subject M2) and uniform (b, for student Jack and subject P2) distributions can be better approximated by GMMs than normal distributions. Thus, the KDE of the projected points (c) can be better approximated by the GMM projection (d) than by UAPCA (e).
  • Figure 5: Projections of the epileptic (left) and hatespeech (right) datasets under different class weighting schemes: equal (top), sample-based (middle), and the class where the weights are shifted (bottom). One class receives most of the weight in the projection to reveal (possibly) hidden structures.