Table of Contents
Fetching ...

Automatic Identification of Compounds in Molecular Mixtures from Liquid-Phase Infrared Spectra

Yannah J. U. Melle, Thanh Nguyen, Jeffrey Lopez, Daniel Schwalbe-Koda

Abstract

Interpreting spectroscopy data is a critical bottleneck in automating chemical research and industrial characterization. Particularly within infrared (IR) spectroscopy, identifying compounds in complex, liquid-phase chemical mixtures largely relies on expert knowledge, as variable peak assignment, broadening, and shifts hinder data-driven methods. Here, we show that an algorithmic approach can identify components in both simulated and experimental mixture spectra with high accuracy despite nonlinearities in liquid-phase IR data. The method is comprehensively benchmarked with a dataset of over 44,000 simulated liquid-phase IR spectra for mixtures and achieves up to 90% accuracy in identifying molecular components across a dataset of binary and ternary liquid mixtures. Our strategy is robust to perturbation of spectra, and its accuracy is capped by near-identical liquid-phase IR spectra that limit the resolution of chemical identification, imposing theoretical limits on achieving perfect accuracy in structure identification. Finally, we apply the method to automatically interpret IR spectra in experimental settings, correctly identifying the components of nearly all samples within a blind study. This work provides tools and data to advance automated chemical laboratories through algorithmic interpretation of liquid-phase IR spectra of mixtures.

Automatic Identification of Compounds in Molecular Mixtures from Liquid-Phase Infrared Spectra

Abstract

Interpreting spectroscopy data is a critical bottleneck in automating chemical research and industrial characterization. Particularly within infrared (IR) spectroscopy, identifying compounds in complex, liquid-phase chemical mixtures largely relies on expert knowledge, as variable peak assignment, broadening, and shifts hinder data-driven methods. Here, we show that an algorithmic approach can identify components in both simulated and experimental mixture spectra with high accuracy despite nonlinearities in liquid-phase IR data. The method is comprehensively benchmarked with a dataset of over 44,000 simulated liquid-phase IR spectra for mixtures and achieves up to 90% accuracy in identifying molecular components across a dataset of binary and ternary liquid mixtures. Our strategy is robust to perturbation of spectra, and its accuracy is capped by near-identical liquid-phase IR spectra that limit the resolution of chemical identification, imposing theoretical limits on achieving perfect accuracy in structure identification. Finally, we apply the method to automatically interpret IR spectra in experimental settings, correctly identifying the components of nearly all samples within a blind study. This work provides tools and data to advance automated chemical laboratories through algorithmic interpretation of liquid-phase IR spectra of mixtures.
Paper Structure (32 sections, 10 equations, 15 figures, 2 tables)

This paper contains 32 sections, 10 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: MD-generated gas- and liquid-phase pure and mixture IR spectra and cumulative intensity difference metric analysis. a, Simulated gas and liquid spectra for 3-(dihydroxymethyl)piperidine (molecule A) and 4(5)-ethylimidazole (molecule B). b, Simulated mixture spectrum of a liquid-phase mixture of molecule A and B (bottom) and the equal weight linear combination (average) of the two molecular spectra (top). The linear sum is not equivalent to the true simulated mixture. c, Raw and cumulative intensities of molecule A's gas- and liquid-phase spectra between 4150--3800 $\mathrm{cm}^{-1}$, illustrating the gas-to-liquid peak shift and broadening. d, Fragment-driven differences between gas and liquid spectra. Distributions and per-core means of fragment-level z-scores for the cumulative distribution function (CDF) between gas- and liquid-phase spectra. Molecules are decomposed into a Murko-scaffold "core" and their largest remaining fragment. Average CDF values are standardized (z-score) within each core, removing core-specific effects and isolating fragment-dependent contributions. The white dots represent the average of each distribution. e, Mode-specific relative composition of the most common co-fragments for molecules containing a carboxylic acid fragment (O=CO). Molecules are assigned to one of two modes by fitting a two-component Gaussian to their per-core z-score cumulative intensity differences.
  • Figure 2: Identification accuracies of unknown components of two-component liquid-phase mixtures from simulated IR spectra, using MD-generated simulated pure-component IR spectra. a, Workflow to identify unknown mixture components from a liquid-phase mixture spectrum. Given an unknown spectrum and our database of pure-component spectra, an algorithm is used to predict mixture components. b, Prediction accuracies using NNLS, LS, and regularized variants to identify two-component mixtures. Gas- and liquid-phase mixtures were predicted using both gas and liquid pure-spectra basis sets, with NNLS achieving the highest liquid-phase accuracy. c, Prediction accuracies using NNLS for gas and liquid phase mixtures as a function of spectral peak shifts. d, Examples of spectra with increasing peak shift magnitudes (in $\mathrm{cm}^{-1}$).
  • Figure 3: Two-component liquid-phase mixture identification accuracies obtained with NNLS as a function of pure liquid-phase spectra dataset size and prediction criteria. a Identification accuracies as a function of pure liquid-phase basis set size. Accuracy is reported for identifying all true components from the largest $k=2$-$5$ NNLS coefficients compared with the interpolation baseline, which selects the top two coefficients from a brute-force convex interpolation over all spectrum pairs. For $k=2$, NNLS achieves higher identification accuracy than the interpolation baseline across all dataset sizes. b Identification accuracies as the prediction criterion increases from $k=2–10$ are evaluated by (i) requiring all true components to appear within the top $k$, (ii) requiring at least one (any) true component appears within the top $k$ coefficients, and (iii) applying an atom-count filter that restricts candidate components whose combined atomic compositions match the mixture’s atom count (as would be available from mass spectrometry (MS)).
  • Figure 4: Misidentification profiles for predicting all components in two-component liquid-phase mixtures using the top $k=2$ NNLS coefficients. a, True vs. falsely predicted components for characteristic misidentification examples: (i) a predicted component differs by the addition or removal of a carbon relative to the true component; (ii) a predicted component is an isomer of the true component; (iii) a predicted component differs by one-atom substitution; and (iv) misidentification not covered by (i)-(iii). b, Percentage of two-component mixtures that were misidentified when using the top $k=2$ NNLS coefficients to identify both components, aggregated across pure-component dataset sizes. "Mixed" indicates mixtures where multiple misidentification categories (carbon difference, isomer, substitution) apply to the true-false component pair. c, Molecules closest to the true component relative to 3-(dihydroxymethyl)piperidine (molecule A) by spectral distance (MSE) and their corresponding spectra. The spectral similarity among these molecules makes them ambiguous to the NNLS algorithm that minimizes squared error (MSE-equivalent). The variable "n" indicates the number of times the nearest neighbor molecule was incorrectly predicted instead of the true component, across all evaluated mixtures and pure-spectra dataset sizes.
  • Figure 5: Fraction of mixture spectra explained by components ranked by decreasing NNLS coefficient for two- and three-component liquid mixtures. a, Average cumulative and incremental percentage of the explained spectrum across all two- and three-component liquid-phase mixtures, ranked in decreasing NNLS coefficient order, as a function of basis set size with associated errors. b, NNLS coefficients for the top six components and the percentage of the spectrum explained for a three-component liquid mixture. The plateau in the explained-spectra curve indicates the likely number of components in the mixture, as additional component spectra weighted by their coefficients do not further contribute to explaining the mixture.
  • ...and 10 more figures