Table of Contents
Fetching ...

Graph Neural Network Prediction of Infrared Spectra of Interstellar Polycyclic Aromatic Hydrocarbons

Guoqing Tang, Jiang He, Zhao Wang, Dong Qiu

TL;DR

The paper addresses the computational bottleneck of generating infrared spectra for diverse interstellar PAHs by employing graph neural networks (GNNs) and comparing architectures (AFP, GCN, GAT, MPNN) against a fixed-feature baseline using PAHdb data. It systematically evaluates five spectral-distance losses (EMD, JSD, HD, TVD, SIS) and identifies Attentive Fingerprint (AFP) with Jensen–Shannon divergence (JSD) as the final model, achieving substantial speedups over density functional theory. AFP delivers best performance among GNNs, though a circular fingerprint–based MLP baseline can be highly competitive, and JSD proves most robust for low-frequency bands. The framework attains 2–5 orders of magnitude faster spectra generation than DFT, with near-linear scaling in molecular size, enabling rapid approximate spectra for small- to medium-sized PAHs, but extrapolation to large PAHs remains challenging due to limited training data and topology-only representations; future work may integrate physics priors and geometry-aware, equivariant GNNs to improve generalization.

Abstract

Polycyclic aromatic hydrocarbons (PAHs) are recognized as the primary contributors to the aromatic infrared bands (AIBs) widely observed in space. However, analyzing these AIBs remains challenging because of the immense structural diversity within the PAH family, which makes the computation of reliable reference spectra difficult. To address this, we developed an efficient graph neural network (GNN) framework that can predict PAH absorption spectra up to 10,000 times faster than traditional quantum chemical methods. We evaluated four representative GNN architectures, including graph convolutional network (GCN), graph attention network (GAT), message passing neural network (MPNN), and attentive fingerprint (AFP). The AFP model is found to deliver the best overall performance and is further trained using five different spectral distance metrics as loss functions, among which the Jensen-Shannon divergence yields the most accurate and stable results. The model performs best for PAHs containing 20-40 carbon atoms, while accuracy decreases for larger molecules, reflecting the limited availability of training data. Overall, this framework offers a fast method to generate approximate reference spectra for small- to medium-sized PAHs, supporting future AIB analysis.

Graph Neural Network Prediction of Infrared Spectra of Interstellar Polycyclic Aromatic Hydrocarbons

TL;DR

The paper addresses the computational bottleneck of generating infrared spectra for diverse interstellar PAHs by employing graph neural networks (GNNs) and comparing architectures (AFP, GCN, GAT, MPNN) against a fixed-feature baseline using PAHdb data. It systematically evaluates five spectral-distance losses (EMD, JSD, HD, TVD, SIS) and identifies Attentive Fingerprint (AFP) with Jensen–Shannon divergence (JSD) as the final model, achieving substantial speedups over density functional theory. AFP delivers best performance among GNNs, though a circular fingerprint–based MLP baseline can be highly competitive, and JSD proves most robust for low-frequency bands. The framework attains 2–5 orders of magnitude faster spectra generation than DFT, with near-linear scaling in molecular size, enabling rapid approximate spectra for small- to medium-sized PAHs, but extrapolation to large PAHs remains challenging due to limited training data and topology-only representations; future work may integrate physics priors and geometry-aware, equivariant GNNs to improve generalization.

Abstract

Polycyclic aromatic hydrocarbons (PAHs) are recognized as the primary contributors to the aromatic infrared bands (AIBs) widely observed in space. However, analyzing these AIBs remains challenging because of the immense structural diversity within the PAH family, which makes the computation of reliable reference spectra difficult. To address this, we developed an efficient graph neural network (GNN) framework that can predict PAH absorption spectra up to 10,000 times faster than traditional quantum chemical methods. We evaluated four representative GNN architectures, including graph convolutional network (GCN), graph attention network (GAT), message passing neural network (MPNN), and attentive fingerprint (AFP). The AFP model is found to deliver the best overall performance and is further trained using five different spectral distance metrics as loss functions, among which the Jensen-Shannon divergence yields the most accurate and stable results. The model performs best for PAHs containing 20-40 carbon atoms, while accuracy decreases for larger molecules, reflecting the limited availability of training data. Overall, this framework offers a fast method to generate approximate reference spectra for small- to medium-sized PAHs, supporting future AIB analysis.
Paper Structure (8 sections, 7 figures, 3 tables)

This paper contains 8 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Workflow for training an AttentiveFP for IR spectrum computation. SMILES strings from the PAHdb are converted by RDKit into molecular graphs with atom (node) and bond (edge) features. An AttentiveFP encoder performs attention-based message passing over three GNN layers to obtain atom embeddings, followed by an iterative GRU readout that aggregates them into a molecule-level embedding $H_{\mathrm{mol}}$. A linear output layer maps $H_{\mathrm{mol}}$ to the infrared spectrum $\hat{Y}$, and the prediction is trained against the reference spectrum $Y$ using EMD as the primary loss, with JSD, SIS, TVD, and HD available as alternatives.
  • Figure 2: Validation loss (EMD) curves of the five GNN models, in comparison with the MLP model trained with ECFP (dotted line).
  • Figure 3: Distribution of prediction errors for the IR spectra generated by five GNN models and the baseline MLP model, compared against high-level DFT results in the low (left) and high (right) frequency regions.
  • Figure 4: Comparison of predicted (upper) and reference (lower) IR spectra for eight PAHs selected across the JSD error distribution, representing increasing prediction error from best (a) to poorest (h). All spectra are normalized to their maximum intensity.
  • Figure 5: Comparison of GNN-predicted and DFT-computed infrared spectra for three representative pericondensed molecules. The spectra are broadened with a Gaussian line profile (FWHM = 10cm) Yang2020Superhydrogenated.
  • ...and 2 more figures