Table of Contents
Fetching ...

PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes

He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, Yu Li

TL;DR

The paper tackles the challenge of integrating molecule-text information for synthetic chemistry by proposing PRESTO, a progressive pretraining framework that enables cross-modal alignment and interleaved multi-graph reasoning. It introduces a two-stage pretraining strategy (alignment followed by domain incremental pretraining) plus supervised fine-tuning, backed by a curated dataset mix including a PubChem caption set, a large interleaved USPTO-PubChem corpus, and a name-conversion task set, totaling $\sim 3\times 10^6$ samples. The authors demonstrate PRESTO’s competitive downstream performance across reaction prediction, reaction-condition prediction, reagent selection, reaction type classification, and yield regression, highlighting the importance of molecular representation granularity and data configuration. They also discuss limitations and future directions toward richer molecular representations (2D/3D) and expanded domain data to further bridge the chemistry-text modality gap and enable broader, safer practical deployment.

Abstract

Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encourage the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multiple molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study introduces PRESTO(Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves multimodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers competitive results in downstream synthetic chemistry tasks. The code can be found at https://github.com/IDEA-XL/PRESTO.

PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes

TL;DR

The paper tackles the challenge of integrating molecule-text information for synthetic chemistry by proposing PRESTO, a progressive pretraining framework that enables cross-modal alignment and interleaved multi-graph reasoning. It introduces a two-stage pretraining strategy (alignment followed by domain incremental pretraining) plus supervised fine-tuning, backed by a curated dataset mix including a PubChem caption set, a large interleaved USPTO-PubChem corpus, and a name-conversion task set, totaling samples. The authors demonstrate PRESTO’s competitive downstream performance across reaction prediction, reaction-condition prediction, reagent selection, reaction type classification, and yield regression, highlighting the importance of molecular representation granularity and data configuration. They also discuss limitations and future directions toward richer molecular representations (2D/3D) and expanded domain data to further bridge the chemistry-text modality gap and enable broader, safer practical deployment.

Abstract

Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encourage the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multiple molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study introduces PRESTO(Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves multimodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers competitive results in downstream synthetic chemistry tasks. The code can be found at https://github.com/IDEA-XL/PRESTO.
Paper Structure (65 sections, 1 equation, 12 figures, 22 tables)

This paper contains 65 sections, 1 equation, 12 figures, 22 tables.

Figures (12)

  • Figure 1: Panel (top left) illustrates the components of a prototypical chemical reaction. Panel (bottom left) shows the synthetic chemistry tasks that PRESTO can support as a dialogue assistant. Panel (right) provides an overview of the two primary stages in our Progressive Pretraining Strategy PRESTO: the Molecule-Text Alignment stage and the Domain Incremental Pretraining stage. These stages enable the evolution from single-graph text modeling to complex interleaved multi-graph text modeling.
  • Figure 2: Panel (a) illustrates the interleaved molecule-text dataset format, primarily derived from USPTO-Application USPTO_patent. Panel (b) displays the five tasks included in the Molecular Name Conversion Tasks (directions drawn as arrows), with data mainly sourced from PubChem PubChem, IUPAC IUPAC, and ChEMBL ChEMBL.
  • Figure 3: Performance analysis of different pretraining strategies and dataset configurations.(a) Ablation study on the multi-modal pretraining strategy. (b) We explore various options for the granularity of molecular encoded tokens. (c) Comparison between base (Llama-2) and instruct-tuned (Vicuna v1.5) language models. (d) Ablation study on dataset configuration for PRESTO domain incremental pretraining stage.
  • Figure 4: Comparison of similarity distributions for reaction prediction datasets. The plots show the count of scaffolds within each similarity range for the full test datasets provided in LLaSMol and Mol-Instruction (raw data, lighter shade) and the selected subsets of 1000 scaffolds with the lowest similarities (darker shade).
  • Figure 5: Statistics of the Interleaved Molecule-Text Dataset.
  • ...and 7 more figures