Table of Contents
Fetching ...

Unifying Molecular and Textual Representations via Multi-task Language Modelling

Dimitrios Christofidellis, Giorgio Giannone, Jannis Born, Ole Winther, Teodoro Laino, Matteo Manica

TL;DR

This work addresses the lack of a unified representation between natural language and chemical representations by introducing Text+Chem T5, a multi-task, multi-domain Transformer built on the T5 backbone. It enables translation across chemistry and language tasks with encoder sharing and a single decoder, avoiding task-specific heads and heavy mono-domain pretraining. Across diverse tasks including forward/reverse reaction prediction, text-to-SMILES, SMILES-to-caption, and paragraph-to-action, the model achieves state-of-the-art-like performance on cross-domain benchmarks, with gains amplifying with scale and data augmentation. The approach facilitates end-to-end molecular discovery workflows and improves human–model interaction, representing a scalable pathway toward accelerated scientific discovery in the life sciences.

Abstract

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

Unifying Molecular and Textual Representations via Multi-task Language Modelling

TL;DR

This work addresses the lack of a unified representation between natural language and chemical representations by introducing Text+Chem T5, a multi-task, multi-domain Transformer built on the T5 backbone. It enables translation across chemistry and language tasks with encoder sharing and a single decoder, avoiding task-specific heads and heavy mono-domain pretraining. Across diverse tasks including forward/reverse reaction prediction, text-to-SMILES, SMILES-to-caption, and paragraph-to-action, the model achieves state-of-the-art-like performance on cross-domain benchmarks, with gains amplifying with scale and data augmentation. The approach facilitates end-to-end molecular discovery workflows and improves human–model interaction, representing a scalable pathway toward accelerated scientific discovery in the life sciences.

Abstract

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.
Paper Structure (27 sections, 8 figures, 12 tables)

This paper contains 27 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Molecule to Caption task. This plot compares the performance of three different models with different sizes (Text+Chem T5-base, Text+Chem T5-small, MolT5-base, MolT5-small, T5-base, and T5-small) on the task of converting SMILES to captions, using six different metrics: BLUE-2, BLEU-4, Rouge-1, Rouge-2, Rouge-L, and Meteor. The models are compared by plotting their scores on the y-axis. The graph shows that our proposal, Text+Chem T5, performs the best on all metrics and improves with size, corroborating our hypothesis that joint learning on molecular and textual domains leveraging multitask learning is a powerful paradigm to bridge the gap between domains.
  • Figure 2: Description to Molecule task. This plot compares the performance of three different models with different sizes (Text+Chem T5-base, Text+Chem T5-small, MolT5-base, MolT5-small, T5-base, and T5-small) on the task of converting captions to SMILES, using five different metrics: Accuracy, Morgan FTS, RDK FTS, BLEU, MACCS FTS. The models are compared by plotting their scores on the y-axis. The graph shows that our proposal, Text+Chem T5, performs the best on all metrics and improves with size, corroborating our hypothesis that joint learning on molecular and textual domains leveraging multi-task learning is a powerful paradigm to bridge the gap between domains.
  • Figure 3: Text+Chem T5 pipeline. The Text+Chem T5 pipeline is a multi-task, multi-domain language model that integrates natural and chemical language. The model can solve language tasks, chemical tasks, and cross-domain tasks, without the need for task-specific fine-tuning or retraining. The chemical tasks that the model can solve are forward reaction prediction and retro-synthesis. The forward reaction task is about predicting the outcome of a chemical reaction based on the starting materials, and the retro-synthesis task is about predicting the starting materials required to synthesize a given chemical compound. The cross-domain tasks that the model can solve are text-to-molecule (text-conditional de novo generation) and molecule-to-text (molecular captioning). The text-to-molecule task is where the model takes a textual description of a molecule as an input and generates its SMILES representation. The molecule-to-text task is where the model takes a molecule represented as SMILES and generates its human-readable textual description. For the mono-domain, language task, we focus on paragraph-to-action, given a paragraph describing how to build a molecule, and output the actions required to obtain that result. The model leverages large, pre-trained single-domain models, such as T5 raffel2020exploring, to solve all these tasks effectively. The pre-trained models serve as a good starting point for fine-tuning the target distribution of tasks. Further variants of the Text+Chem T5 model that were explored in this work are shown in \ref{['fig:clm-family']}.
  • Figure 4: The Chemical Language Model (CLM) family. The caption describes three different approaches to building a multi-domain model for text and chemistry tasks. A: a multi-domain model is built without the need to retrain the single-domain encoders (no enc-sharing, no enc-training). Instead, two frozen sets of weights ($\bar{\phi}_t$, $\bar{\phi}_c$) are used for the text and chemistry encoders respectively. These weights are extracted from large, pre-trained language encoders, such as T5 raffel2020exploring and T5Chem lu2022unified. B: a multi-domain model is still built using two sets of weights. However, the chemistry encoder is fine-tuned (enc-training) while the text encoder remains frozen (no enc-sharing). The fine-tuning process starts from a pre-trained T5 checkpoint(1.0) fine-tuned on chemistry data. C: The final, proposed Text+Chem T5 model. the encoders are merged, using a joint encoder for text and chemistry ($\phi_t = \phi_c$) and trained jointly on the multi-domain and multi-task data (enc-training, enc-sharing). This approach allows the model to be fine-tuned on a variety of tasks and domains, which improves its generalization capabilities. A T5 decoder is used and no separate heads are used for each task or domain. The sharing of information between tasks and domains enriches the model's generalization. $V_t$ is the vocabulary for text and $V_c$ is the one for chemistry.
  • Figure 5: Molecule to Caption task. This plot compares the performance of three different models with different sizes (Text+Chem T5-base, Text+Chem T5-small, MolT5-base, MolT5-small, T5-base, and T5-small) on the task of converting SMILES to captions, using six different metrics: BLUE-2, BLEU-4, Rouge-1, Rouge-2, Rouge-L, and Meteor. The models are compared by plotting their scores on the y-axis. The graph shows that our proposal, Text+Chem T5, performs the best on all metrics and improves with size, corroborating our hypothesis that joint learning on molecular and textual domains leveraging multitask learning is a powerful paradigm to bridge the gap between domains.
  • ...and 3 more figures