Table of Contents
Fetching ...

MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning

Andrei Manolache, Dragos Tantaru, Mathias Niepert

TL;DR

A simple transformer-based baseline for multimodal molecular representation learning, integrating three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules, demonstrating its effectiveness as a strong baseline for multimodal molecular representation learning.

Abstract

In this work, we propose a simple transformer-based baseline for multimodal molecular representation learning, integrating three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules. A key aspect of our approach is the aggregation of 3D conformers, allowing the model to account for the fact that molecules can adopt multiple conformations-an important factor for accurate molecular representation. The tokens for each modality are extracted using modality-specific encoders: a transformer for SMILES strings, a message-passing neural network for 2D graphs, and an equivariant neural network for 3D conformers. The flexibility and modularity of this framework enable easy adaptation and replacement of these encoders, making the model highly versatile for different molecular tasks. The extracted tokens are then combined into a unified multimodal sequence, which is processed by a downstream transformer for prediction tasks. To efficiently scale our model for large multimodal datasets, we utilize Flash Attention 2 and bfloat16 precision. Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets, demonstrating its effectiveness as a strong baseline for multimodal molecular representation learning.

MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning

TL;DR

A simple transformer-based baseline for multimodal molecular representation learning, integrating three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules, demonstrating its effectiveness as a strong baseline for multimodal molecular representation learning.

Abstract

In this work, we propose a simple transformer-based baseline for multimodal molecular representation learning, integrating three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules. A key aspect of our approach is the aggregation of 3D conformers, allowing the model to account for the fact that molecules can adopt multiple conformations-an important factor for accurate molecular representation. The tokens for each modality are extracted using modality-specific encoders: a transformer for SMILES strings, a message-passing neural network for 2D graphs, and an equivariant neural network for 3D conformers. The flexibility and modularity of this framework enable easy adaptation and replacement of these encoders, making the model highly versatile for different molecular tasks. The extracted tokens are then combined into a unified multimodal sequence, which is processed by a downstream transformer for prediction tasks. To efficiently scale our model for large multimodal datasets, we utilize Flash Attention 2 and bfloat16 precision. Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets, demonstrating its effectiveness as a strong baseline for multimodal molecular representation learning.

Paper Structure

This paper contains 11 sections, 2 theorems, 6 equations, 3 figures, 3 tables.

Key Result

Theorem 1

Let $S$ be the SMILES string, $G$ be the 2D graph, and $\{c_1, \dots, c_k\}$ be a set of $k$ 3D conformers for a molecule. Let $\hat{y} = f_\theta(S, G, \{c_1, \dots, c_k\})$ be the output prediction obtained as described in eq: smiles - eq: mmtf2. Let our 3D encoder be invariant to the actions of s

Figures (3)

  • Figure 1: Modality ablation study on the Kraken dataset (MAE $\downarrow$). We keep the downstream Transformer fixed and train using a single modality or a combination of modalities. Using all three modalities obtains the best results on three out of the four properties, with the second-best results generally being obtained by a configuration that contains 3D conformers. Notably, for the buried Sterimol L property, the best results are obtained by a 3D encoder + Transformer model, indicating that the property could mainly depend on the 3D structure.
  • Figure 2: Transfer learning experiment. We select the best checkpoint of a model trained to predict the electronegativity ($\chi$) on the Drugs-75K dataset. We then freeze the model and only train the last linear readout layer on the Kraken dataset. We compare with a randomly initialized model. For all descriptors, using the pre-trained weights improve predictive performance. Note that pretraining improves both mean performance and standard deviations.
  • Figure 3: Red lines mark the boundaries between modalities on the key axis (with keys for each token represented by columns), while yellow lines mark the boundaries on the query axis (with queries represented by rows). The modalities are ordered as 3D, SMILES, and 2D. The attention scores are taken from the first layer of the model and clipped to the $[-10, 10]$ range.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem A.1
  • proof