Information loss from dimensionality reduction in 5D-Gaussian spectral data

A. Schelle; H. Lüling

Information loss from dimensionality reduction in 5D-Gaussian spectral data

A. Schelle, H. Lüling

TL;DR

The paper addresses information loss when projecting high-dimensional Gaussian spectral data to lower-dimensional representations used in AI analytics. It models spectral data with Gaussian statistics and Shannon entropy, employing Monte Carlo sampling to compare correlated (conditional) and uncorrelated entropies under dimensionality reduction. The main finding is that the relative information loss for correlated entropy remains below $1\%$ for moderate to large sample sizes, while uncorrelated entropy can exhibit higher losses; as sample size grows, the loss tends toward zero. The work supports the reliability of 2D reductions in spectral analytics for disease diagnostics and highlights the need for deeper analyses to explain incorrect predictions beyond projection effects.

Abstract

Understanding the loss of information in spectral analytics is a crucial first step towards finding root causes for failures and uncertainties using spectral data in artificial intelligence models built from modern complex data science applications. Here, we show from an elementary Shannon entropy model analysis with quantum statistics of Gaussian distributed spectral data, that the relative loss of information from dimensionality reduction due to the projection of an initial five-dimensional dataset onto two-dimensional diagrams is less than one percent in the parameter range of small data sets with sample sizes on the order of few hundred data samples. From our analysis, we also conclude that the density and expectation value of the entropy probability distribution increases with the sample number and sample size using artificial data models derived from random sampling Monte Carlo simulation methods.

Information loss from dimensionality reduction in 5D-Gaussian spectral data

TL;DR

for moderate to large sample sizes, while uncorrelated entropy can exhibit higher losses; as sample size grows, the loss tends toward zero. The work supports the reliability of 2D reductions in spectral analytics for disease diagnostics and highlights the need for deeper analyses to explain incorrect predictions beyond projection effects.

Abstract

Paper Structure (4 sections, 4 equations, 3 figures)

This paper contains 4 sections, 4 equations, 3 figures.

Introduction
Theory
Results and analytics
Proposal and Outlook

Figures (3)

Figure 1: (color online) Figure shows $10^4$ realizations of the total Shannon entropy versus the sum of conditional entropies in an artificial setup of $N=400$ particles containing 5 frequency components with corresponding photon occupation numbers $N_k$. The mean value of the distribution is around $27.6$ for the sum of conditional entropies, and $5.5$ for the total entropy, and thus the relative information loss per frequency component measured by the Shannon entropy, as defined in Eq. (\ref{['entropy']}) is less than one percent.
Figure 2: (color online) Figure shows $10^5$ realizations of the total Shannon entropy versus the sum of conditional entropies in an artificial setup of $N=40$ particles containing 5 components with corresponding photon occupation numbers $N_k$. The structure of the entropy distribution can be better recognized as the number of sampling steps increases.
Figure 3: (color online) Figure shows $10^5$ realizations of the total Shannon entropy versus the sum of uncorrelated entropies in an artificial setup of $N=40$ particles containing 5 components with corresponding photon occupation numbers $N_k$. It is observed that the relative loss of information is about five to ten times larger for (the sum of) uncorrelated Shannon entropies as compared to (the sum of) correlated, i.e. conditional Shannon entropies.

Information loss from dimensionality reduction in 5D-Gaussian spectral data

TL;DR

Abstract

Information loss from dimensionality reduction in 5D-Gaussian spectral data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)