Table of Contents
Fetching ...

Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

Liuzhenghao Lv, Hao Li, Yu Wang, Zhiyuan Yan, Zijun Chen, Zongying Lin, Li Yuan, Yonghong Tian

TL;DR

The Heterogeneous Molecular Encoding (HME) framework, a unified molecular encoder compressing the molecular features from fragment sequence, topology, and conformation with Q-learning, offers a new perspective on navigating chemical-linguistic sharing space, advancing the potential of CLMs in both fundamental research and practical applications in chemistry.

Abstract

Chemical language models (CLMs) are prominent for their effectiveness in exploring chemical space and enabling molecular engineering. However, while exploring chemical-linguistic space, CLMs suffer from the gap between natural language and molecular representations. This challenge is primarily due to the inherent modeling differences between molecules and texts: molecules operate unified modeling to learn chemical space, while natural language sequentially models the semantic space. Additionally, the limited availability of high-quality text-to-molecule datasets further exacerbates this challenge. To address the problem, we first verified the information bias in molecular representations from different perspectives. We then developed the Heterogeneous Molecular Encoding (HME) framework, a unified molecular encoder compressing the molecular features from fragment sequence, topology, and conformation with Q-learning. To better model chemical-linguistic space, we further constructed the MCMoD dataset, which contains over one million molecules with various conditions, including properties, fragments, and descriptions. Experimentally, HME promotes CLMs to achieve chemical-linguistic sharing space exploration: (1) chemical space exploration with linguistic guidance, where HME achieves significant improvements (+8.9\% FCD) for molecular design in multiple constraints, even in zero-shot scenarios; (2) linguistic space exploration with molecular guidance, where HME generates textual descriptions with high qualities (+11.6\% BLEU) for molecules. These results highlight the precision of HME in handling multi-objective and cross-domain tasks, as well as its remarkable generalization capability on unseen task combinations. HME offers a new perspective on navigating chemical-linguistic sharing space, advancing the potential of CLMs in both fundamental research and practical applications in chemistry.

Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

TL;DR

The Heterogeneous Molecular Encoding (HME) framework, a unified molecular encoder compressing the molecular features from fragment sequence, topology, and conformation with Q-learning, offers a new perspective on navigating chemical-linguistic sharing space, advancing the potential of CLMs in both fundamental research and practical applications in chemistry.

Abstract

Chemical language models (CLMs) are prominent for their effectiveness in exploring chemical space and enabling molecular engineering. However, while exploring chemical-linguistic space, CLMs suffer from the gap between natural language and molecular representations. This challenge is primarily due to the inherent modeling differences between molecules and texts: molecules operate unified modeling to learn chemical space, while natural language sequentially models the semantic space. Additionally, the limited availability of high-quality text-to-molecule datasets further exacerbates this challenge. To address the problem, we first verified the information bias in molecular representations from different perspectives. We then developed the Heterogeneous Molecular Encoding (HME) framework, a unified molecular encoder compressing the molecular features from fragment sequence, topology, and conformation with Q-learning. To better model chemical-linguistic space, we further constructed the MCMoD dataset, which contains over one million molecules with various conditions, including properties, fragments, and descriptions. Experimentally, HME promotes CLMs to achieve chemical-linguistic sharing space exploration: (1) chemical space exploration with linguistic guidance, where HME achieves significant improvements (+8.9\% FCD) for molecular design in multiple constraints, even in zero-shot scenarios; (2) linguistic space exploration with molecular guidance, where HME generates textual descriptions with high qualities (+11.6\% BLEU) for molecules. These results highlight the precision of HME in handling multi-objective and cross-domain tasks, as well as its remarkable generalization capability on unseen task combinations. HME offers a new perspective on navigating chemical-linguistic sharing space, advancing the potential of CLMs in both fundamental research and practical applications in chemistry.
Paper Structure (13 sections, 12 equations, 4 figures, 2 tables)

This paper contains 13 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The Framework of our HME. (a) The alignment stage. We project features from different molecular encoders (i.e., tokenizers) into a unified space and align them with the textual space. (b) The task-specific supervised fine-tuning stage. Based on low-rank decomposition, we perform autoregressive generative modeling on a general transformer decoder, which can generate text, molecules, and fragments.
  • Figure 2: Experimental Analysis for Molecular Comprehension.a Captioning Task and General QA Task: The violin plot of the similarity scores between generated texts and reference texts from HME and baselines. The area size is positively correlated with the proportion of high-quality texts. B-2 means the metric BLEU-2, R-1 means the metric ROUGE-1, and M means the metric Meteor. $\mu$ means the average score of all texts from HME and $\mu^*$ means the average score of high-quality texts from HME. b Property QA Task: The distribution of the absolute error between the prediction values and ground-truth values from HME and baselines. The error of HME is closer to zero, demonstrating its strong capability in molecular property prediction. The value of Root Mean Square Error (RMSE) is also reported.
  • Figure 3: Visualization for Conditional Molecular Generation of HME.a We visualize eight synthetic molecules designed by HME under the joint control of property values and specific fragments. $\alpha$ denotes the target property value while $\beta$ denotes the actual value calculated by RDKit. To the left of the dotted line are the specified fragments, which are also highlighted in the generated molecules. b Similarly, we visualize eight natural-product-like molecules designed by HME. These cases validate that HME can design molecules with the property values and fragments we specify. c We visualize protein ligands with potential high affinity designed by HME. $\gamma$ and $\phi$ denote the predicted docking scores by QuickVINA2 and SMINA.
  • Figure 4: Experimental Analysis for Conditional Molecular Generation.a The distribution of the actual property values calculated by RDKit of the generated molecules under the control of the target property. Our model can effectively follow property controls. b Nine examples to demonstrate the effectiveness of property control. With the Pyridine as the anchor fragment condition, we use different target property types and values as property conditions. $\alpha$ denotes the target property value while $\beta$ denotes the actual value. c Statistics on the effectiveness of the fragment control condition, where $\gamma$ represents the specified fragment number and $\phi$ denotes the number of fragments appearing in the generated molecule. d The generalization ability of HME in conditional molecular generation. L, Q, and S represent the abbreviations for LogP, QED, and SAS, respectively. $\delta$ represents the proportion of molecules whose actual property values fall within the target property value range of $\pm 1$ or $\pm 0.1$. The molecules generated by HME in a zero-shot manner align well with the desired properties.