Table of Contents
Fetching ...

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang

TL;DR

This work addresses molecular property prediction by leveraging the complementary strengths of Large Language Models (LLMs) and Domain-specific Small Models (DSMs). It introduces MolGraph-LarDo, which employs a two-stage prompt strategy to calibrate LLM-generated domain knowledge with DSM-derived metrics, and a multi-modal graph-text alignment to guide graph pre-training. The approach yields superior performance on MoleculeNet benchmarks compared to both supervised and pre-trained baselines, while mitigating hallucinations and reducing the need for extensive domain expertise. The framework offers scalable, knowledge-efficient molecular representation learning with practical impact for drug discovery workflows, and the accompanying code enhances reproducibility.

Abstract

Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method.

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

TL;DR

This work addresses molecular property prediction by leveraging the complementary strengths of Large Language Models (LLMs) and Domain-specific Small Models (DSMs). It introduces MolGraph-LarDo, which employs a two-stage prompt strategy to calibrate LLM-generated domain knowledge with DSM-derived metrics, and a multi-modal graph-text alignment to guide graph pre-training. The approach yields superior performance on MoleculeNet benchmarks compared to both supervised and pre-trained baselines, while mitigating hallucinations and reducing the need for extensive domain expertise. The framework offers scalable, knowledge-efficient molecular representation learning with practical impact for drug discovery workflows, and the accompanying code enhances reproducibility.

Abstract

Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method.
Paper Structure (26 sections, 3 equations, 4 figures, 2 tables)

This paper contains 26 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the proposed framework. (a) Two-stage Prompt Strategy (b) Molecular Graph-Text Alignment.
  • Figure 2: A detailed case of the two-stage prompt strategy on DATASET BBBP. Left: Dataset-specific prompt for generating Molecular Description Template (MD-Template); Right: Sample-specific prompt for generating Molecular Description Text (MD-Text).
  • Figure 3: Case study of prompt and MD-Text on FreeSolv dataset for three versions of MolGraph-LarDo.
  • Figure 4: Results of ablation experiments for the graph-text alignment.