Table of Contents
Fetching ...

SPECTRA: Spectral Target-Aware Graph Augmentation for Imbalanced Molecular Property Regression

Brenda Nogueira, Meng Jiang, Nitesh V. Chawla, Nuno Moniz

TL;DR

SPECTRA addresses the problem of imbalanced molecular property regression by operating in the graph spectral domain to generate realistic, topology-preserving augmentations. It aligns molecular graphs with FGW, interpolates Laplacian spectra and node features in a shared basis, and reconstructs chemically valid intermediate graphs with interpolated targets. A rarity-aware budgeting scheme concentrates augmentation where data are scarce, while a spectral GNN with edge-aware Chebyshev convolutions leverages the augmented data without sacrificing global performance. Across ESOL, FreeSolv, and Lipophilicity benchmarks, SPECTRA yields improved recovery in underrepresented target ranges, maintains competitive MAE, and produces interpretable synthetic molecules tied to the spectral geometry, demonstrating the practicality of spectral, geometry-aware augmentation for imbalanced molecular regression.

Abstract

In molecular property prediction, the most valuable compounds (e.g., high potency) often occupy sparse regions of the target space. Standard Graph Neural Networks (GNNs) commonly optimize for the average error, underperforming on these uncommon but critical cases, with existing oversampling methods often distorting molecular topology. In this paper, we introduce SPECTRA, a Spectral Target-Aware graph augmentation framework that generates realistic molecular graphs in the spectral domain. SPECTRA (i) reconstructs multi-attribute molecular graphs from SMILES; (ii) aligns molecule pairs via (Fused) Gromov-Wasserstein couplings to obtain node correspondences; (iii) interpolates Laplacian eigenvalues, eigenvectors and node features in a stable share-basis; and (iv) reconstructs edges to synthesize physically plausible intermediates with interpolated targets. A rarity-aware budgeting scheme, derived from a kernel density estimation of labels, concentrates augmentation where data are scarce. Coupled with a spectral GNN using edge-aware Chebyshev convolutions, SPECTRA densifies underrepresented regions without degrading global accuracy. On benchmarks, SPECTRA consistently improves error in relevant target ranges while maintaining competitive overall MAE, and yields interpretable synthetic molecules whose structure reflects the underlying spectral geometry. Our results demonstrate that spectral, geometry-aware augmentation is an effective and efficient strategy for imbalanced molecular property regression.

SPECTRA: Spectral Target-Aware Graph Augmentation for Imbalanced Molecular Property Regression

TL;DR

SPECTRA addresses the problem of imbalanced molecular property regression by operating in the graph spectral domain to generate realistic, topology-preserving augmentations. It aligns molecular graphs with FGW, interpolates Laplacian spectra and node features in a shared basis, and reconstructs chemically valid intermediate graphs with interpolated targets. A rarity-aware budgeting scheme concentrates augmentation where data are scarce, while a spectral GNN with edge-aware Chebyshev convolutions leverages the augmented data without sacrificing global performance. Across ESOL, FreeSolv, and Lipophilicity benchmarks, SPECTRA yields improved recovery in underrepresented target ranges, maintains competitive MAE, and produces interpretable synthetic molecules tied to the spectral geometry, demonstrating the practicality of spectral, geometry-aware augmentation for imbalanced molecular regression.

Abstract

In molecular property prediction, the most valuable compounds (e.g., high potency) often occupy sparse regions of the target space. Standard Graph Neural Networks (GNNs) commonly optimize for the average error, underperforming on these uncommon but critical cases, with existing oversampling methods often distorting molecular topology. In this paper, we introduce SPECTRA, a Spectral Target-Aware graph augmentation framework that generates realistic molecular graphs in the spectral domain. SPECTRA (i) reconstructs multi-attribute molecular graphs from SMILES; (ii) aligns molecule pairs via (Fused) Gromov-Wasserstein couplings to obtain node correspondences; (iii) interpolates Laplacian eigenvalues, eigenvectors and node features in a stable share-basis; and (iv) reconstructs edges to synthesize physically plausible intermediates with interpolated targets. A rarity-aware budgeting scheme, derived from a kernel density estimation of labels, concentrates augmentation where data are scarce. Coupled with a spectral GNN using edge-aware Chebyshev convolutions, SPECTRA densifies underrepresented regions without degrading global accuracy. On benchmarks, SPECTRA consistently improves error in relevant target ranges while maintaining competitive overall MAE, and yields interpretable synthetic molecules whose structure reflects the underlying spectral geometry. Our results demonstrate that spectral, geometry-aware augmentation is an effective and efficient strategy for imbalanced molecular property regression.

Paper Structure

This paper contains 27 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Distribution of target property values across three molecular datasets (ESOL, FreeSolv, and Lipo). Each subplot shows a normalized histogram of the experimental values with a Gaussian kernel density estimate (KDE) overlaid using Scott’s rule-of-thumb bandwidth. These plots highlight the skewness and spread of target distributions, which can influence model training and performance.
  • Figure 1: Validity, Uniqueness, and Novelty of generated molecules across datasets.
  • Figure 2: Pipeline of spectral molecular interpolation. Molecular graphs are first aligned via Gromov–Wasserstein matching, after which their three edge-specific Laplacians are decomposed and interpolated in the spectral domain, while node features are projected into the aligned eigenbasis and combined in the same way. Target values are interpolated alongside these representations, producing coherent intermediate graphs that preserve topology while smoothly blending molecular properties and labels to enrich underrepresented regions of the distribution.
  • Figure 3: Joint distribution plots of molecular properties versus task targets for original (blue, circles, solid marginals) and augmented (orange, crosses, dashed marginals) molecules. Each row corresponds to a dataset (FreeSolv, ESOL, Lipo), and each column shows one computed property (LogP, SA, QED, MW, BT).
  • Figure 4: Mean Absolute Error (MAE) distribution across target value ranges for each dataset. Colors correspond to different models as indicated in the legend.
  • ...and 1 more figures