Table of Contents
Fetching ...

MolTC: Towards Molecular Relational Modeling In Language Models

Junfeng Fang, Shuai Zhang, Chang Wu, Zhengyi Yang, Zhiyuan Liu, Sihang Li, Kun Wang, Wenjie Du, Xiang Wang

TL;DR

MolTC addresses the underutilization of molecular graph structure in LLM-based molecular relational learning by unifying graph-based encodings with a decoder-only LLM under Chain-of-Thought reasoning. It introduces a Graph Encoder–Representation Projector–SMILES Injector architecture that feeds into the Galactica backbone, guided by a multi-hierarchical CoT training paradigm and the MoT-instruction dataset. The method achieves superior performance across 12 datasets and over 4,000,000 molecular pairs, outperforming both GNN and LLM baselines on qualitative and quantitative interaction tasks. This work demonstrates the value of integrating graph-structured information with CoT-guided LLM reasoning for robust molecular interaction prediction and lays groundwork for shared, multimodal biochemical reasoning resources.

Abstract

Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of information underutilization, as it hinders the sharing of interaction mechanism learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. To train MolTC efficiently, we introduce a Multi-hierarchical CoT concept to refine its training paradigm, and conduct a comprehensive Molecular Interactive Instructions dataset for the development of biochemical LLMs involving MRL. Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines. Code is available at https://github.com/MangoKiller/MolTC.

MolTC: Towards Molecular Relational Modeling In Language Models

TL;DR

MolTC addresses the underutilization of molecular graph structure in LLM-based molecular relational learning by unifying graph-based encodings with a decoder-only LLM under Chain-of-Thought reasoning. It introduces a Graph Encoder–Representation Projector–SMILES Injector architecture that feeds into the Galactica backbone, guided by a multi-hierarchical CoT training paradigm and the MoT-instruction dataset. The method achieves superior performance across 12 datasets and over 4,000,000 molecular pairs, outperforming both GNN and LLM baselines on qualitative and quantitative interaction tasks. This work demonstrates the value of integrating graph-structured information with CoT-guided LLM reasoning for robust molecular interaction prediction and lays groundwork for shared, multimodal biochemical reasoning resources.

Abstract

Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of information underutilization, as it hinders the sharing of interaction mechanism learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. To train MolTC efficiently, we introduce a Multi-hierarchical CoT concept to refine its training paradigm, and conduct a comprehensive Molecular Interactive Instructions dataset for the development of biochemical LLMs involving MRL. Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines. Code is available at https://github.com/MangoKiller/MolTC.
Paper Structure (24 sections, 5 equations, 2 figures, 5 tables)

This paper contains 24 sections, 5 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Comparison between the current methods leveraging LLMs to model molecule interactions and our MolTC. (a) The prevailing paradigm of current methods. (b) The challenge of applying the current paradigm to the tasks involving datasets with a small number of samples. (c) The framework of our proposed MolTC, which is enhanced by the principle of CoT. Best viewed in color.
  • Figure 2: The training process of our MolTC. The flame symbol denotes the parameter update, the snowflake symbol indicates the parameter freezing, and the chain symbol depicts the parameter sharing between two modules. Best viewed in color.