Table of Contents
Fetching ...

Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling

Alessio Fallani, Ramil Nugmanov, Jose Arjona-Medina, Jörg Kurt Wegner, Alexandre Tkatchenko, Kostiantyn Chernichenko

TL;DR

It is found that models pretrained on atomic quantum mechanical properties capture more low-frequency Laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule.

Abstract

We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyse the latent representations and observe that the supervised strategies preserve the pretraining information after finetuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.

Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling

TL;DR

It is found that models pretrained on atomic quantum mechanical properties capture more low-frequency Laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule.

Abstract

We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyse the latent representations and observe that the supervised strategies preserve the pretraining information after finetuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.

Paper Structure

This paper contains 23 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Visual representation for a molecule in the TDC dataset of the comparison between the most relevant eigenvectors of the attention rollout matrix from a model pretrained on atom-level QM properties and the low-frequency eigenvectors of the graph Laplacian associated to the molecular structure. $\ket{a_i}$ are the eigenvectors of the Attention Rollout matrix $\Tilde{A}$ with eigenvalue $a_i$ and $\ket{l_i}$ are the eigenvectors of the graph Laplacian $L$ with eigenvalue $l_i$.
  • Figure 2: $R^2$ for the regression tasks using the representations of a sample of the pretraining data obtained with fine-tuned models. We report the mean and standard deviation over all fine-tuning cases (mean and standard deviation over twenty-two cases).
  • Figure 3: Expressivity of the latent representation measured with the quantity $\rho_L$ as a function of layer number. This quantity is computed for a sample of 2200 structures extracted uniformly from all the fine-tuning test sets (100 structures for each of the 22 tasks) and results are reported as boxplots at each layer. This is done for models pretrained on HLG, models pretrained on all atom-level QM properties, models pretrained with masking and models trained from scratch. The whiskers go from the 15th percentile to the 85th for better visualization of trends and outliers are excluded for the same reason.
  • Figure 4: Spectral perception of the input graphs for the models fine-tuned on the TDC datasets grouped by pretraining strategy. This is reported in the form of swarm plots of the values of $\zeta$ averaged across each of the 22 fine-tuning test sets for fixed pretraining strategy.
  • Figure 5: Boxplots of the $k^{th}$ neighbour normalized sensitivities $\mathcal{S}_k$ for $k\in [1, \dots,5]$. Each boxplot summarizes a sample of 1100 structures extracted uniformly from all the fine-tuning test sets (50 structures for each of the 22 tasks). We report this quantity for all studied pretraining strategies, and also for the models trained from scratch. The whiskers cover the values from the 15th percentile to the 85th for better visualization of trends. Outliers are excluded for the same reason.