Table of Contents
Fetching ...

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Jiayun Pang, Ivan Vulić

TL;DR

The paper addresses how to leverage language-pretrained encoder-decoder models for organic reaction prediction without GPU-intensive molecule pretraining. It conducts a systematic empirical study of FlanT5 and ByT5 variants across tokenisation, SMILES-related pretraining, data-efficient fine-tuning, and decoding strategies on multiple USPTO-based tasks. Key findings show that language-only pretraining yields competitive performance, with byte-level ByT5 often offering robustness, and that vocabulary trimming plus greedy decoding can maintain accuracy while boosting efficiency; continued SMILES pretraining offers mixed or negative gains. The results suggest cheaper, more accessible pathways for applying state-of-the-art language models to chemistry tasks, guiding future research on data-efficient, multi-task, and modular fine-tuning in cheminformatics.

Abstract

Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become `chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

TL;DR

The paper addresses how to leverage language-pretrained encoder-decoder models for organic reaction prediction without GPU-intensive molecule pretraining. It conducts a systematic empirical study of FlanT5 and ByT5 variants across tokenisation, SMILES-related pretraining, data-efficient fine-tuning, and decoding strategies on multiple USPTO-based tasks. Key findings show that language-only pretraining yields competitive performance, with byte-level ByT5 often offering robustness, and that vocabulary trimming plus greedy decoding can maintain accuracy while boosting efficiency; continued SMILES pretraining offers mixed or negative gains. The results suggest cheaper, more accessible pathways for applying state-of-the-art language models to chemistry tasks, guiding future research on data-efficient, multi-task, and modular fine-tuning in cheminformatics.

Abstract

Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become `chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
Paper Structure (6 sections, 7 figures, 3 tables)

This paper contains 6 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of the key areas explored along the flow of pretraining, fine-tuning and inference in our work
  • Figure 2: Illustration of different preprocessing strategies for SMILES input.
  • Figure 3: The reaction mechanism to generate ketoxime from ketone and hydroxylamine.
  • Figure 4: Computed Shapley values for the reaction
  • Figure 5: Visualisation of the impact of tokens in the reactants and reagents on the first few tokens in the predicted product
  • ...and 2 more figures