Table of Contents
Fetching ...

TrustMol: Trustworthy Inverse Molecular Design via Alignment with Molecular Dynamics

Kevin Tirta Wijaya, Navid Ansari, Hans-Peter Seidel, Vahid Babaei

TL;DR

TrustMol addresses the trustworthiness gap in inverse molecular design by aligning a neural surrogate with the native forward process through an NFP-aware optimization framework. It introduces a latent-space VAE that jointly encodes SELFIES, 3D geometry, and properties (SGP-VAE), and mitigates VAE reconstruction issues with latent-property data augmentation. The method incorporates epistemic uncertainty via ensemble surrogates to guide latent optimization, enhancing alignment with NFP and reducing the surrogate-NFP error gap. Rigorous NFP-based evaluation with Psi4 demonstrates that TrustMol achieves superior accuracy, reliability, and efficiency in single- and multi-objective IMD compared with diffusion-based and direct surrogate-based baselines, marking a significant step toward practical, trustworthy IMD.

Abstract

Data-driven generation of molecules with desired properties, also known as inverse molecular design (IMD), has attracted significant attention in recent years. Despite the significant progress in the accuracy and diversity of solutions, existing IMD methods lag behind in terms of trustworthiness. The root issue is that the design process of these methods is increasingly more implicit and indirect, and this process is also isolated from the native forward process (NFP), the ground-truth function that models the molecular dynamics. Following this insight, we propose TrustMol, an IMD method built to be trustworthy. For this purpose, TrustMol relies on a set of technical novelties including a new variational autoencoder network. Moreover, we propose a latent-property pairs acquisition method to effectively navigate the complexities of molecular latent optimization, a process that seems intuitive yet challenging due to the high-frequency and discontinuous nature of molecule space. TrustMol also integrates uncertainty-awareness into molecular latent optimization. These lead to improvements in both explainability and reliability of the IMD process. We validate the trustworthiness of TrustMol through a wide range of experiments.

TrustMol: Trustworthy Inverse Molecular Design via Alignment with Molecular Dynamics

TL;DR

TrustMol addresses the trustworthiness gap in inverse molecular design by aligning a neural surrogate with the native forward process through an NFP-aware optimization framework. It introduces a latent-space VAE that jointly encodes SELFIES, 3D geometry, and properties (SGP-VAE), and mitigates VAE reconstruction issues with latent-property data augmentation. The method incorporates epistemic uncertainty via ensemble surrogates to guide latent optimization, enhancing alignment with NFP and reducing the surrogate-NFP error gap. Rigorous NFP-based evaluation with Psi4 demonstrates that TrustMol achieves superior accuracy, reliability, and efficiency in single- and multi-objective IMD compared with diffusion-based and direct surrogate-based baselines, marking a significant step toward practical, trustworthy IMD.

Abstract

Data-driven generation of molecules with desired properties, also known as inverse molecular design (IMD), has attracted significant attention in recent years. Despite the significant progress in the accuracy and diversity of solutions, existing IMD methods lag behind in terms of trustworthiness. The root issue is that the design process of these methods is increasingly more implicit and indirect, and this process is also isolated from the native forward process (NFP), the ground-truth function that models the molecular dynamics. Following this insight, we propose TrustMol, an IMD method built to be trustworthy. For this purpose, TrustMol relies on a set of technical novelties including a new variational autoencoder network. Moreover, we propose a latent-property pairs acquisition method to effectively navigate the complexities of molecular latent optimization, a process that seems intuitive yet challenging due to the high-frequency and discontinuous nature of molecule space. TrustMol also integrates uncertainty-awareness into molecular latent optimization. These lead to improvements in both explainability and reliability of the IMD process. We validate the trustworthiness of TrustMol through a wide range of experiments.
Paper Structure (26 sections, 9 equations, 6 figures, 6 tables)

This paper contains 26 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The overall pipeline of TrustMol. (A) shows the the training of SGP-VAE, in which $\Phi_{\text{dec}}$ is trained to reconstruct ${\bm{x}}_{\text{graph}}$ and ${\bm{x}}_{\text{selfies}}$, and predict the properties ${\bm{q}}$ from the latent vector ${\bm{z}}$. (B) is the latent-property pairs acquisition method to create a new dataset for training the surrogate model. (C) shows the latent optimization process to generate ${\bm{z}}$ from ${\bm{q}}$. (D) is the post-processing and evaluation, in which the optimal latent ${\bm{z}}^*$ is decoded back into molecular string ${\bm{x}}^*_{\text{selfies}}$, and evaluated based on the NFP.
  • Figure 2: The architecture for the latent-to-property subnetwork. A (x, y) block represents an nn.Linear layer with an input dimensionality of x and an output dimensionality of y.
  • Figure 3: Additional regularizations can be easily incorporated into TrustMol. Here, we add molecular mass to the optimization objectives, penalizing molecular designs with high masses. We can see that the distribution of the generated molecular designs shifts toward molecules with lower molecular mass.
  • Figure 4: Visualization of the hypervolume of MAE for LIMO and TrustMol. We can clearly see the smaller space covered by the hypervolume of TrustMol.
  • Figure 5: Plot of epistemic uncertainty values predicted by a surrogate model and the distribution of HOMO values in the training dataset.
  • ...and 1 more figures