Table of Contents
Fetching ...

Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction

Jonathan Broadbent, Michael Bailey, Mingxuan Li, Abhishek Paul, Louis De Lescure, Paul Chauvin, Lorenzo Kogler-Anele, Yasser Jangjou, Sven Jager

TL;DR

Solvation prediction for small molecules is challenged by the tension between physics-based accuracy and scalable data-driven models. Solvaformer unifies SE(3)-equivariant intra-molecular processing with inter-molecular scalar attention across independent molecular copies, trained with alternating batches on CombiSolv-QM and BigSolDB 2.0 for solvation energy and LogS. The approach delivers strong performance among learned models, closely approaching a DFT-assisted baseline, while providing interpretable attributions that reveal physically plausible intra- vs inter-molecular hydrogen-bonding effects. The method advances practical solubility prediction by combining geometric bias with mixed data sources, enabling efficient screening and hypothesis generation in solution-phase chemistry. Overall, Solvaformer offers a scalable, interpretable framework that leverages multi-task learning and geometry-aware transformers to improve solubility and solvation energy predictions in real-world settings.

Abstract

Accurate prediction of small molecule solubility using material-sparing approaches is critical for accelerating synthesis and process optimization, yet experimental measurement is costly and many learning approaches either depend on quantumderived descriptors or offer limited interpretability. We introduce Solvaformer, a geometry-aware graph transformer that models solutions as multiple molecules with independent SE(3) symmetries. The architecture combines intramolecular SE(3)-equivariant attention with intermolecular scalar attention, enabling cross-molecular communication without imposing spurious relative geometry. We train Solvaformer in a multi-task setting to predict both solubility (log S) and solvation free energy, using an alternating-batch regimen that trains on quantum-mechanical data (CombiSolv-QM) and on experimental measurements (BigSolDB 2.0). Solvaformer attains the strongest overall performance among the learned models and approaches a DFT-assisted gradient-boosting baseline, while outperforming an EquiformerV2 ablation and sequence-based alternatives. In addition, token-level attention produces chemically coherent attributions: case studies recover known intra- vs. inter-molecular hydrogen-bonding patterns that govern solubility differences in positional isomers. Taken together, Solvaformer provides an accurate, scalable, and interpretable approach to solution-phase property prediction by uniting geometric inductive bias with a mixed dataset training strategy on complementary computational and experimental data.

Solvaformer: an SE(3)-equivariant graph transformer for small molecule solubility prediction

TL;DR

Solvation prediction for small molecules is challenged by the tension between physics-based accuracy and scalable data-driven models. Solvaformer unifies SE(3)-equivariant intra-molecular processing with inter-molecular scalar attention across independent molecular copies, trained with alternating batches on CombiSolv-QM and BigSolDB 2.0 for solvation energy and LogS. The approach delivers strong performance among learned models, closely approaching a DFT-assisted baseline, while providing interpretable attributions that reveal physically plausible intra- vs inter-molecular hydrogen-bonding effects. The method advances practical solubility prediction by combining geometric bias with mixed data sources, enabling efficient screening and hypothesis generation in solution-phase chemistry. Overall, Solvaformer offers a scalable, interpretable framework that leverages multi-task learning and geometry-aware transformers to improve solubility and solvation energy predictions in real-world settings.

Abstract

Accurate prediction of small molecule solubility using material-sparing approaches is critical for accelerating synthesis and process optimization, yet experimental measurement is costly and many learning approaches either depend on quantumderived descriptors or offer limited interpretability. We introduce Solvaformer, a geometry-aware graph transformer that models solutions as multiple molecules with independent SE(3) symmetries. The architecture combines intramolecular SE(3)-equivariant attention with intermolecular scalar attention, enabling cross-molecular communication without imposing spurious relative geometry. We train Solvaformer in a multi-task setting to predict both solubility (log S) and solvation free energy, using an alternating-batch regimen that trains on quantum-mechanical data (CombiSolv-QM) and on experimental measurements (BigSolDB 2.0). Solvaformer attains the strongest overall performance among the learned models and approaches a DFT-assisted gradient-boosting baseline, while outperforming an EquiformerV2 ablation and sequence-based alternatives. In addition, token-level attention produces chemically coherent attributions: case studies recover known intra- vs. inter-molecular hydrogen-bonding patterns that govern solubility differences in positional isomers. Taken together, Solvaformer provides an accurate, scalable, and interpretable approach to solution-phase property prediction by uniting geometric inductive bias with a mixed dataset training strategy on complementary computational and experimental data.

Paper Structure

This paper contains 28 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example of Solvaformer performing an update of the hidden representation of a nitrogen atom ($H_i$) in a single layer for inputs water and acetamide. Irreps features contain relative 3D positions defined by spherical harmonics and are collected only from intramolecular atoms. Since only scalar features are received from the atoms of the solvent, the hidden representation is constructed assuming that intermolecular atoms are not SE(3) equivariant.
  • Figure 2: Solute-to-solvent attention maps demonstrating Solvaformer's chemical intuition. (b) For the para isomer, the hydroxyl proton (H15) shows clear attention to water, indicating intermolecular H-bonding. (d) For the ortho isomer, this attention from H15 disappears, correctly implying it is occupied in a dominant intramolecular H-bond.
  • Figure 3: 6591 measurements in BigSolDB 2.0 had duplicated measurements from seperate sources (same solute, solvent, temperature but measured in a different labratory and have different measured solubility). We removed these measurements from the dataset. Here we measure the standard deviation within groups of duplicated measurements and plot the distribution. This provides an estimate of the precision of experimental measurements for solubility and hence lower bound for error rate prediction of our dataset.
  • Figure 4: Hyperparameter tuning of Solvaformer. We ran a total of 23 different runs. WandB agents selected hyperparameters of successive runs using Bayesian optimization where performance was measured by MSE on the BigSolDB2.0 validation set.
  • Figure 5: Distribution of measured logS in the train-test split.
  • ...and 1 more figures