Full-Spectrum Machine Learning Diagnostics for Interstellar PAHs
Zhao Wang
TL;DR
Band-ratio diagnostics of interstellar PAHs suffer information loss and sample-selection bias. The authors propose a full-spectrum infrared morphology approach, treating the $2.75-20\mu$m$ emission as a high-dimensional fingerprint and training a Random Forest on $23{,}653$ spectra to infer PAH size and charge. On synthetic mixtures from unseen molecules, they report a macro F1-score of about $0.96$ across $12$ size/charge classes, with interpretable feature importances revealing distinct wavelength fingerprints for size and charge. The method identifies neutral-size signatures in the $3.21-3.29\mu$m$ and $11-14\mu$m$ bands, while charge discrimination relies on $6.25$ and $7.81\mu$m C-C/C-H modes and the $12.5\mu$m feature as a cross-charge tracer, and shows robustness to excitation conditions, enabling robust ISM diagnostics.
Abstract
Traditional interstellar polycyclic aromatic hydrocarbon (PAH) diagnostics rely on empirical band ratios, which often suffer from information loss and sample-selection bias. We introduce a machine learning framework that bypasses these limitations by treating the complete 2.75-20 micron emission spectrum as a high-dimensional fingerprint. Using a Random Forest classifier trained on a dataset of 23,653 spectra, we achieve a robust classification F1-score of about 0.96 across 12 size and charge categories. Our model maintains high performance on synthetic mixtures of unseen molecules. Feature importance analysis reveals that PAH size diagnostics are not universal but highly charge-dependent; while neutral size is traced mainly by C-H stretching modes, sizing ionized species also relies on the morphology of 6-8 micron C-C complexes, with the 12.5 micron feature emerging as a robust cross-charge tracer. This approach provides a robust, data-driven pathway for decoding the physical conditions of the interstellar medium.
