Implementation of the Emulator-based Component Analysis

Anton Vladyka; Eemeli A. Eronen; Johannes Niskanen

Implementation of the Emulator-based Component Analysis

Anton Vladyka, Eemeli A. Eronen, Johannes Niskanen

TL;DR

This work presents a PyTorch-powered implementation of emulator-based component analysis (ECA), a projection-pursuit framework for solving ill-posed nonlinear inverse problems using a fast forward emulator. ECA identifies an orthogonal basis in input space that maximizes the variance of the emulator-predicted targets, enabling reconstruction of approximate inverse solutions in a reduced subspace via transform, inverse, and reconstruct operations. The implementation provides a reusable Python class with tunable optimization settings, demonstrated on a synthetic forward map $y(\mathbf{x})$ and on NEXAFS spectroscopy data using LMBTR descriptors, achieving robust, interpretable dimension reduction (often with only the first one or two components capturing most variance). The work makes the code and data publicly available, enabling reproducibility and broad application to other ill-posed inverse problems with fast emulators.

Abstract

We present a PyTorch-powered implementation of the emulator-based component analysis used for ill-posed numerical non-linear inverse problems, where an approximate emulator for the forward problem is known. This emulator may be a numerical model, an interpolating function, or a fitting function such as a neural network. With the help of the emulator and a data set, the method seeks dimensionality reduction by projection in the variable space so that maximal variance of the target (response) values of the data is covered. The obtained basis set for projection in the variable space defines a subspace of the greatest response for the outcome of the forward problem. The method allows for the reconstruction of the coordinates in this subspace for an approximate solution to the inverse problem. We present an example of using the code provided as a Python class.

Implementation of the Emulator-based Component Analysis

TL;DR

and on NEXAFS spectroscopy data using LMBTR descriptors, achieving robust, interpretable dimension reduction (often with only the first one or two components capturing most variance). The work makes the code and data publicly available, enabling reproducibility and broad application to other ill-posed inverse problems with fast emulators.

Abstract

Paper Structure (8 sections, 6 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 8 sections, 6 equations, 5 figures, 2 tables, 1 algorithm.

Introduction
Definition
Implementation
Discussion
Conclusions
Author Contributions
Acknowledgements
Data availability

Figures (5)

Figure 1: The principle of ECA illustrated for 2 ECA components $\{\mathbf{v}_1, \mathbf{v}_2\}$ in a 3-dimensional $\mathbf{X}$ space $\mathrm{span} \{\mathbf{x}_1, \mathbf{x}_2,\mathbf{x}_3\}$ (a). The ECA components (basis vectors) are selected by iterative optimization so that emulated $\mathbf{y}_\mathrm{emu}$ for the projected data points match closest with the known $\mathbf{y}$ of the original data (b).
Figure 2: Schematics of ECA functionality. ECA uses an emulator, trained on $(\mathbf{X}, \mathbf{Y})_\mathrm{train}$, validated on $(\mathbf{X}, \mathbf{Y})_\mathrm{test}$. Fitting of the ECA object on $(\mathbf{X}, \mathbf{Y})_\mathrm{test}$ (or $(\mathbf{X}, \mathbf{Y})_\mathrm{train}$) results in a basis set $\mathbf{V}$. 'Transform' and 'project' operations are linear transformations of the input vector $\mathbf{x}$. On the contrary, 'inverse' and 'reconstruct' are implemented using optimization algorithms. These are not linear operations and may produce ambiguous results for the complex (highly multidimensional) systems.
Figure 3: Performance of the new implementation on the rudimentary example data. (a) Test set $\mathbf{x}=(x_1,x_2)$ for two-dimensional $(d=2)$ case. Color of each point depicts the corresponding value $y(\mathbf{x})$ (Eq. (\ref{['eq:rudimentary1']})) (b) Projection scores $\mathbf{t}=(t_1,t_2)$ of the data on the two ECA vectors using the same color scheme. (c) Evaluation times for ECA fit for $\mathbf{v}_1$ as a function of dimensionality $d$. We observed a linear dependence of the ECA fitting time on the number of data points for both implementations. (d) Fit success rate measured as fraction of fits with $|\mathbf{v}\cdot\mathbf{v}_1|>0.95$ as a function of $d$ from 25 independent trials.
Figure 4: Reconstructed $t_1$ scores versus the known (projected) ones for input dimensionalities $d=2$ and $d=512$ obtained using (a) the original and (b) the presented implementation. The diagonal grey line indicates a perfect match. The scatter plots show that both the implementations work well in solving the inverse problem for $t_1$, further indicated by the high R$^2$ scores between the reconstructed and known points.
Figure 5: The interatomic distance distribution part of the first component vector for the example of aqueous triglycine. The results shown here are in agreement with the original study in Eronen2024. A non-zero value means that the studied spectral regions are sensitive to changes in this structural feature.

Implementation of the Emulator-based Component Analysis

TL;DR

Abstract

Implementation of the Emulator-based Component Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (5)