Table of Contents
Fetching ...

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

Ankur Samanta, Rohan Gupta, Aditi Misra, Christian McIntosh Clarke, Jayakumar Rajadas

TL;DR

FragmentNet introduces an adaptive graph tokenizer and a graph-to-sequence transformer to learn chemically meaningful molecular representations via Masked Fragment Modeling (MFM). The architecture fuses a VQVAE-GCN encoder, graph spatial positional encodings, and a molecular-descriptor CLS token within a Transformer to achieve data-efficient pretraining and strong downstream performance on MoleculeNet and Malaria benchmarks. A fragment-swapping module enables targeted analogue generation, facilitating SAR exploration while maintaining chemical validity. Empirically, FragmentNet outperforms similarly sized baselines and competitive with larger models trained on far more data, all while running on modest hardware, highlighting a scalable, chemically informed path for molecular design and discovery.

Abstract

Molecular property prediction uses molecular structure to infer chemical properties. Chemically interpretable representations that capture meaningful intramolecular interactions enhance the usability and effectiveness of these predictions. However, existing methods often rely on atom-based or rule-based fragment tokenization, which can be chemically suboptimal and lack scalability. We introduce FragmentNet, a graph-to-sequence foundation model with an adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments while preserving structural connectivity. FragmentNet integrates VQVAE-GCN for hierarchical fragment embeddings, spatial positional encodings for graph serialization, global molecular descriptors, and a transformer. Pre-trained with Masked Fragment Modeling and fine-tuned on MoleculeNet tasks, FragmentNet outperforms models with similarly scaled architectures and datasets while rivaling larger state-of-the-art models requiring significantly more resources. This novel framework enables adaptive decomposition, serialization, and reconstruction of molecular graphs, facilitating fragment-based editing and visualization of property trends in learned embeddings - a powerful tool for molecular design and optimization.

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

TL;DR

FragmentNet introduces an adaptive graph tokenizer and a graph-to-sequence transformer to learn chemically meaningful molecular representations via Masked Fragment Modeling (MFM). The architecture fuses a VQVAE-GCN encoder, graph spatial positional encodings, and a molecular-descriptor CLS token within a Transformer to achieve data-efficient pretraining and strong downstream performance on MoleculeNet and Malaria benchmarks. A fragment-swapping module enables targeted analogue generation, facilitating SAR exploration while maintaining chemical validity. Empirically, FragmentNet outperforms similarly sized baselines and competitive with larger models trained on far more data, all while running on modest hardware, highlighting a scalable, chemically informed path for molecular design and discovery.

Abstract

Molecular property prediction uses molecular structure to infer chemical properties. Chemically interpretable representations that capture meaningful intramolecular interactions enhance the usability and effectiveness of these predictions. However, existing methods often rely on atom-based or rule-based fragment tokenization, which can be chemically suboptimal and lack scalability. We introduce FragmentNet, a graph-to-sequence foundation model with an adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments while preserving structural connectivity. FragmentNet integrates VQVAE-GCN for hierarchical fragment embeddings, spatial positional encodings for graph serialization, global molecular descriptors, and a transformer. Pre-trained with Masked Fragment Modeling and fine-tuned on MoleculeNet tasks, FragmentNet outperforms models with similarly scaled architectures and datasets while rivaling larger state-of-the-art models requiring significantly more resources. This novel framework enables adaptive decomposition, serialization, and reconstruction of molecular graphs, facilitating fragment-based editing and visualization of property trends in learned embeddings - a powerful tool for molecular design and optimization.

Paper Structure

This paper contains 61 sections, 3 equations, 13 figures, 9 tables, 2 algorithms.

Figures (13)

  • Figure 1: (Left) A molecule is tokenized into discrete substructures, with one fragment masked. Atom-level attributes are encoded using a VQVAE, and fragment-level attributes are learned using a GCN. The pooled sequence, enriched with spatial positional encodings and a CLS token, is passed through a Transformer. (Right) In MFM, the masked token embedding predicts the masked fragment, while for downstream tasks, the pooled representation encodes the full molecule, which can be reconstructed.
  • Figure 2: t-SNE of Embedding space for Lipophilicity prediction task before (left) and after (right) fine-tuning, extracted from the second-to-last model layer. Color depicts true values of each data-point.
  • Figure 3: 2-methyl-1-butanol, CCC(C)CO Attention Map for ESOL (aqueous solubility)
  • Figure 4: Left: Ibuprofen. Center: We fragment off the carboxylic acid and methyl groups. Right: We generate two alternative molecules through fragment swapping, CC(C)Cc1ccc(Cl)cc1, and CC(C)Cc1ccc(cc1)C(=O)O
  • Figure 5: (Top) Distribution of number of atoms in each token in our token dictionary, (Bottom) Number of fragments for each molecule in the pre-training dataset
  • ...and 8 more figures