SPECTRA: Spectral Target-Aware Graph Augmentation for Imbalanced Molecular Property Regression
Brenda Nogueira, Meng Jiang, Nitesh V. Chawla, Nuno Moniz
TL;DR
SPECTRA addresses the problem of imbalanced molecular property regression by operating in the graph spectral domain to generate realistic, topology-preserving augmentations. It aligns molecular graphs with FGW, interpolates Laplacian spectra and node features in a shared basis, and reconstructs chemically valid intermediate graphs with interpolated targets. A rarity-aware budgeting scheme concentrates augmentation where data are scarce, while a spectral GNN with edge-aware Chebyshev convolutions leverages the augmented data without sacrificing global performance. Across ESOL, FreeSolv, and Lipophilicity benchmarks, SPECTRA yields improved recovery in underrepresented target ranges, maintains competitive MAE, and produces interpretable synthetic molecules tied to the spectral geometry, demonstrating the practicality of spectral, geometry-aware augmentation for imbalanced molecular regression.
Abstract
In molecular property prediction, the most valuable compounds (e.g., high potency) often occupy sparse regions of the target space. Standard Graph Neural Networks (GNNs) commonly optimize for the average error, underperforming on these uncommon but critical cases, with existing oversampling methods often distorting molecular topology. In this paper, we introduce SPECTRA, a Spectral Target-Aware graph augmentation framework that generates realistic molecular graphs in the spectral domain. SPECTRA (i) reconstructs multi-attribute molecular graphs from SMILES; (ii) aligns molecule pairs via (Fused) Gromov-Wasserstein couplings to obtain node correspondences; (iii) interpolates Laplacian eigenvalues, eigenvectors and node features in a stable share-basis; and (iv) reconstructs edges to synthesize physically plausible intermediates with interpolated targets. A rarity-aware budgeting scheme, derived from a kernel density estimation of labels, concentrates augmentation where data are scarce. Coupled with a spectral GNN using edge-aware Chebyshev convolutions, SPECTRA densifies underrepresented regions without degrading global accuracy. On benchmarks, SPECTRA consistently improves error in relevant target ranges while maintaining competitive overall MAE, and yields interpretable synthetic molecules whose structure reflects the underlying spectral geometry. Our results demonstrate that spectral, geometry-aware augmentation is an effective and efficient strategy for imbalanced molecular property regression.
