Table of Contents
Fetching ...

MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules

Idelfonso B. R. Nogueira, Carine M. Rebelloa, Mumin Enis Leblebici, Erick Giovani Sperandio Nascimento

TL;DR

Results demonstrate that multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training, and Furthermore, MultiPUFFIN handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision.

Abstract

Predicting physicochemical properties across chemical space is vital for chemical engineering, drug discovery, and materials science. Current molecular foundation models lack thermodynamic consistency, while domain-informed approaches are limited to single properties and small datasets. We introduce MultiPUFFIN, a domain-constrained multimodal foundation model addressing both limitations simultaneously. MultiPUFFIN features: (i) an encoder fusing SMILES, graphs, and 3D geometries via gated cross-modal attention, alongside experimental condition and descriptor encoders; (ii) prediction heads embedding established correlations (e.g., Wagner, Andrade, van't Hoff, and Shomate equations) as inductive biases to ensure thermodynamic consistency; and (iii) a two-stage multi-task training strategy.Extending prior frameworks, MultiPUFFIN predicts nine thermophysical properties simultaneously. It is trained on a multi-source dataset of 37,968 unique molecules (40,904 rows). With roughly 35 million parameters, MultiPUFFIN achieves a mean $R^2 = 0.716$ on a challenging scaffold-split test set of 8,877 molecules. Compared to ChemBERTa-2 (pre-trained on 77 million molecules), MultiPUFFIN outperforms the fine-tuned baseline across all nine properties despite using 2000x fewer training molecules. Advantages are strikingly apparent for temperature-dependent properties, where ChemBERTa-2 lacks the architectural capacity to incorporate thermodynamic conditions.These results demonstrate that multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training. Furthermore, MultiPUFFIN handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision. Systematic ablation studies confirm the property-specific benefits of these domain-informed prediction heads.

MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules

TL;DR

Results demonstrate that multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training, and Furthermore, MultiPUFFIN handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision.

Abstract

Predicting physicochemical properties across chemical space is vital for chemical engineering, drug discovery, and materials science. Current molecular foundation models lack thermodynamic consistency, while domain-informed approaches are limited to single properties and small datasets. We introduce MultiPUFFIN, a domain-constrained multimodal foundation model addressing both limitations simultaneously. MultiPUFFIN features: (i) an encoder fusing SMILES, graphs, and 3D geometries via gated cross-modal attention, alongside experimental condition and descriptor encoders; (ii) prediction heads embedding established correlations (e.g., Wagner, Andrade, van't Hoff, and Shomate equations) as inductive biases to ensure thermodynamic consistency; and (iii) a two-stage multi-task training strategy.Extending prior frameworks, MultiPUFFIN predicts nine thermophysical properties simultaneously. It is trained on a multi-source dataset of 37,968 unique molecules (40,904 rows). With roughly 35 million parameters, MultiPUFFIN achieves a mean on a challenging scaffold-split test set of 8,877 molecules. Compared to ChemBERTa-2 (pre-trained on 77 million molecules), MultiPUFFIN outperforms the fine-tuned baseline across all nine properties despite using 2000x fewer training molecules. Advantages are strikingly apparent for temperature-dependent properties, where ChemBERTa-2 lacks the architectural capacity to incorporate thermodynamic conditions.These results demonstrate that multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training. Furthermore, MultiPUFFIN handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision. Systematic ablation studies confirm the property-specific benefits of these domain-informed prediction heads.
Paper Structure (63 sections, 14 equations, 13 figures, 13 tables)

This paper contains 63 sections, 14 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Architecture overview of MultiPUFFIN. A molecule is simultaneously encoded through three structural modalities: a 2D molecular graph (GCN encoder), a SMILES string (Transformer encoder), and a 3D conformer (SchNet geometry encoder). Two auxiliary encoders process non-structural information: an experimental encoder for thermodynamic conditions (temperature, pressure) and a descriptor encoder for precomputed molecular descriptors. Bidirectional cross-modal attention between the GCN and Transformer branches is followed by gated fusion, with the SchNet embedding incorporated through a learned geometry gate. The auxiliary encoder outputs are concatenated with the fused structural embedding before the final 512-dimensional unified representation feeds nine property-specific prediction heads. Six of these heads embed established thermophysical equations (Wagner, Andrade, van 't Hoff, group contribution, Born, and Shomate) as domain-informed inductive bias neurons, while three properties (log $P$, melting point, and flash point) use direct feedforward architectures. Stars ($\star$) indicate enhanced-capacity heads for vapor pressure and boiling point.
  • Figure 2: Test RMSE (left) and MAE (right) across all nine properties for training, validation, and test splits (logarithmic scale). The consistent increase from training to test error reflects the generalization challenge imposed by the scaffold-based splitting strategy. Properties are ordered by decreasing test performance. Note the logarithmic y-axis: temperature-based properties (e.g., melting point, boiling point) have RMSE in the tens of kelvins, while logarithmic-scale properties (e.g., viscosity, vapor pressure) have RMSE below 2.
  • Figure 3: Test set parity plots (predicted vs. experimental) for all nine physicochemical properties. The solid diagonal line represents perfect prediction ($y = x$). Each panel reports the $R^2$, RMSE, MAE, and number of test samples. Properties are ordered by decreasing test $R^2$.
  • Figure 4: Test set residual distributions for all nine properties. Each panel shows the histogram of residuals (predicted $-$ experimental) with a Gaussian fit (solid line) and a zero-residual reference line (dashed red). The mean ($\mu$) and standard deviation ($\sigma$) of the residuals are annotated.
  • Figure 5: Data availability per property and split. The highly heterogeneous sample counts across properties reflect the different availability of experimental measurements in public databases. Training counts reflect unique molecules (before SMILES augmentation).
  • ...and 8 more figures