Infrared Spectra Prediction for Diazo Groups Utilizing a Machine Learning Approach with Structural Attention Mechanism
Chengchun Liu, Fanyang Mo
TL;DR
Infrared spectroscopy provides molecular fingerprints but predicting spectra and interpreting structure–property relationships remains challenging. The authors introduce a Structural Attention Mechanism descriptor (SAMD) to prioritize chemical information near functional groups and combine it with Morgan fingerprints, integrating an ensemble stacking–voting approach to predict IR spectra of diazo compounds. The method achieves high predictive accuracy (cross-validated $R^2$ around $0.969$), demonstrates robustness across molecular similarity and noise, and even extrapolates to unstable species like diazomethane with close agreement to experimental and theoretical benchmarks; SHAP analysis reveals chemically meaningful descriptors (notably carbonyl count and the atom attached to the diazo group) as key drivers. This work provides a scalable, interpretable framework for spectroscopic prediction in complex molecules and paves the way for extending SAM-guided predictions to other functional groups and broader spectroscopic contexts.
Abstract
Infrared (IR) spectroscopy is a pivotal technique in chemical research for elucidating molecular structures and dynamics through vibrational and rotational transitions. However, the intricate molecular fingerprints characterized by unique vibrational and rotational patterns present substantial analytical challenges. Here, we present a machine learning approach employing a Structural Attention Mechanism tailored to enhance the prediction and interpretation of infrared spectra, particularly for diazo compounds. Our model distinguishes itself by honing in on chemical information proximal to functional groups, thereby significantly bolstering the accuracy, robustness, and interpretability of spectral predictions. This method not only demystifies the correlations between infrared spectral features and molecular structures but also offers a scalable and efficient paradigm for dissecting complex molecular interactions.
