Table of Contents
Fetching ...

Spectra-Scope : A toolkit for automated and interpretable characterization of material properties from spectral data

Amalya C. Johnson, Chris Fajardo, Leena Sansguiri, Weike Ye, Steven B. Torrisi

TL;DR

Spectra-Scope is presented, an open-source AutoML framework for automatic characterization of material properties from spectroscopy data using interpretable machine learning (ML) models and its emphasis on interpretability can be used to rationalize the behavior of individual models and understand the physical processes behind spectral features.

Abstract

Spectroscopy is a central pillar of materials characterization, providing useful information on properties like structure, composition, or excited state dynamics of a system. However, many spectroscopic techniques present challenges in development of interpretable, performant, and reliable supervised learning models due to the wide range of possible nonlinear correlations that can exist between the signal and the response variable (target) of interest. Here, we present Spectra-Scope, an open-source AutoML framework for automatic characterization of material properties from spectroscopy data using interpretable machine learning (ML) models. The software is implemented in Python and a no-code web application. It comprises tools for data preprocessing, nonlinear feature extraction, machine learning model training, and feature downselection. Users can easily train different types of simple, interpretable ML models on a set of feature transformations quickly and with modest computational resources. In this work, we outline the methods of Spectra-Scope and its effectiveness across diverse datasets, with applications to materials and agricultural spectroscopy data. We show that Spectra-Scope can reproduce performance of comparable models in the literature, and highlight how our emphasis on interpretability can be used to rationalize the behavior of individual models and understand the physical processes behind spectral features.

Spectra-Scope : A toolkit for automated and interpretable characterization of material properties from spectral data

TL;DR

Spectra-Scope is presented, an open-source AutoML framework for automatic characterization of material properties from spectroscopy data using interpretable machine learning (ML) models and its emphasis on interpretability can be used to rationalize the behavior of individual models and understand the physical processes behind spectral features.

Abstract

Spectroscopy is a central pillar of materials characterization, providing useful information on properties like structure, composition, or excited state dynamics of a system. However, many spectroscopic techniques present challenges in development of interpretable, performant, and reliable supervised learning models due to the wide range of possible nonlinear correlations that can exist between the signal and the response variable (target) of interest. Here, we present Spectra-Scope, an open-source AutoML framework for automatic characterization of material properties from spectroscopy data using interpretable machine learning (ML) models. The software is implemented in Python and a no-code web application. It comprises tools for data preprocessing, nonlinear feature extraction, machine learning model training, and feature downselection. Users can easily train different types of simple, interpretable ML models on a set of feature transformations quickly and with modest computational resources. In this work, we outline the methods of Spectra-Scope and its effectiveness across diverse datasets, with applications to materials and agricultural spectroscopy data. We show that Spectra-Scope can reproduce performance of comparable models in the literature, and highlight how our emphasis on interpretability can be used to rationalize the behavior of individual models and understand the physical processes behind spectral features.
Paper Structure (14 sections, 1 equation, 6 figures, 1 table)

This paper contains 14 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Outline of this paper and the Spectra-Scope pipeline. (a) Input data can come from any experimental or simulated 1-D array data source for inference on a scalar response variable. (b) Available featurizations of spectral data include the cumulative distribution function, gaussian peak fitting, principal component analysis, polynomial peak fitting, and others as outlined in the methods. (c) Transformed spectra are used to train a machine learning algorithm. This paper focuses on the LCEN algorithm and random forests, but Spectra-Scope can be incorporated with user built models as well. (d) Model training of LCEN and random forests includes feature selection either in the form of LCEN feature downselection or random forest feature importances. (e) The algorithms are used to predict the input response variables. Feature selection helps with model interpretability and investigating modality importance.
  • Figure 2: Front page of Spectra-Scope application. Multiple data types can be input and visualized on the home page. The app includes abilities to featurize data, visualize featurizations, train models using random forests or LCEN, and visualize the important or downselected features by the model.
  • Figure 3: Regressing mean nearest-neighbor distance from simulated XANES spectra and PDFs of Ti-oxide structures. (a) Summary of RMSE for regressing bond length using LCEN and random forests for XANES, PDF, XANES + PDF, and other transformations of the data. CDF: cumulative distribution function. NLTrans: Nonlinear feature expansion as outlined in the main text and the supplementary information. Clustering: Feature agglomeration clustering. The top three features and model combinations correspond with green, blue, and red stars, respectively. Comparison of the important features identified when using (b) the CDF transformation of XANES spectra, (c) the first 10 principal components of XANES spectra, and polynomial transformations of (d) XANES, (e) PDF, and (f) XANES + PDF for regression with random forests. (b) CDF of all XANES spectra in blue. Vertical dashed line: top three important features for prediction using the CDF. (c) First two principal components of the XANES spectra colored by bond length. Color bar : bond length. (d) All XANES spectra from the dataset (blue), average spectrum (black) and corresponding 10 most important features for prediction (characteristic polynomial images). The top three most important features are highlighted by the vertical dashed line. (e) All PDF from the dataset (red), average spectrum (black), and positions of top 10 most important features for prediction. The top three most important features are highlighted by the vertical dashed line. (f) Average XANES spectra (blue) and PDF (red) for simultaneously using both datasets for regression. Dashed lines show the top three most important features extracted from fitting XANES (blue) and PDF (red) spectra separately.
  • Figure 4: Regressing grape sugar content. (a) % RMSE for random forests and LCEN models built on Vis-NIR and Raman spectra transformed in different ways. The top three features and model combinations correspond with green, blue, and red stars, respectively. (b) Top 20 most important features for predicting TSS with the full spectrum (i) and polynomial features extracted from the spectrum (ii) using random forests. (c) 20 highest absolute magnitude coefficients for regressing TSS with the full spectrum (i) and polynomial features extracted from the spectrum (ii) using LCEN. Blue vertical lines : selected/important features. Characteristic polynomials : selected/important features.
  • Figure 5: Fused LASSO selected Features. The top panel shows all of the NIR spectra in the dataset. The bottom panel shows the regression coefficients for fused LASSO models with different regularization parameters $\alpha$.
  • ...and 1 more figures