Table of Contents
Fetching ...

The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

Umberto Michelucci, Francesca Venturini

Abstract

Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.

The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

Abstract

Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.

Paper Structure

This paper contains 28 sections, 7 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Illustration of the concentration of measure for multivariate Gaussian distributions. Shown are the empirical distributions of $\|x\|_2$ for samples drawn from $\mathcal{N}(0,\,1.0^2 I_n)$ (light blue) and $\mathcal{N}(0,\,1.1^2 I_n)$ (yellow), for increasing dimensionalities $n=2$, $50$, $500$, and $5000$ (panels A–D). In low dimensions the two distributions overlap substantially, but as $n$ increases the probability mass concentrates sharply around the typical radius $\sigma\sqrt{n}$, and even small variance differences cause almost complete separation. This illustrates how, in high-dimensional spaces, measures supported by Gaussian (and many non-Gaussian) distributions become effectively disjoint, providing an intuitive geometric basis for the Feldman--Hájek theorem and for the sensitivity of high-dimensional classifiers to minute statistical differences.
  • Figure 2: Ten representative synthetic one-peak spectra per class used in the synthetic-spectra experiments. Each curve is a Lorentzian profile sampled on an $n=100$-point axis, with peak centre jittered as $c \sim \mathcal{N}(50,10^2)$. Class 0 (blue) and Class 1 (orange) differ only through the FWHM $\xi_1=7$ vs. $\xi_2=9$, illustrating that the two classes are visually difficult to distinguish despite being statistically separable in high dimension.
  • Figure 3: Fluorescence spectra of Spanish olive oil samples classified as Extra Virgin (EVOO), Virgin (VOO), and Lampante (LOO). The region 380--420 nm indicates the Rayleigh scattering peak from the excitation LED. The black line indicates the average spectrum for each class.
  • Figure 4: Results from experiment N1. Classification accuracy of QDA (regularisation parameter equal to 0.4) as a function of the standard-deviation gap $\Delta\sigma$ between two white-noise classes with equal mean $\mu=1$ and baseline $\sigma_1=1$. Each curve corresponds to a different number of points per array ($n\in{[5,10,50,500]}$); at each $\Delta\sigma$, $N$ arrays per class are generated and split 80/20 into train/test. The dashed line at $1.0$ marks perfect accuracy. The results are for the test dataset. Panel (A) has been obtained with a Toeptlitz covariance with $\rho=0.95$, while panel (B) with a homogeneous covariance. It is an interesting observation that adding correlations between neighbouring values, slow down the growth to perfect accuracy, but it does not stop it altogether.
  • Figure 5: Results from experiment N2. Accuracy of the Bayes classifier for two Gaussian white-noise classes with common mean $\mu=10$ and variances $\sigma_1^2 I_n$ vs. $\sigma_2^2 I_n$. The decision uses the sufficient statistic $S=\sum_{j=1}^{n}(x_j-\mu)^2$ with the LDA threshold $T$.
  • ...and 8 more figures