Table of Contents
Fetching ...

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

Yuhan Chen, Nuwa Xi, Yanrui Du, Haochun Wang, Jianyu Chen, Sendong Zhao, Bing Qin

TL;DR

This paper first introduces a retrieval-based prompting strategy to construct high-quality pseudo data, then explores the optimal method to effectively leverage this pseudo data to address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs).

Abstract

Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery. Our code and data are available at https://github.com/SCIR-HI/ArtificiallyR2R.

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

TL;DR

This paper first introduces a retrieval-based prompting strategy to construct high-quality pseudo data, then explores the optimal method to effectively leverage this pseudo data to address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs).

Abstract

Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery. Our code and data are available at https://github.com/SCIR-HI/ArtificiallyR2R.
Paper Structure (27 sections, 6 figures, 4 tables)

This paper contains 27 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of translation between molecule and description in cross-modal molecule discovery.
  • Figure 2: The workflow for pseudo data generation. Starting with an unlabeled molecule represented by its Morgan Fingerprints, two stages are involved. In stage 1, the input molecule serves as a search query to retrieve the top-k similar molecules from a local database containing 37,898 annotated molecule-caption pairs. In stage 2, the retrieved molecules and their captions are integrated into a prompt. Then LLMs perform in-context learning and generate a description for the input molecule.
  • Figure 3: Comparison of data quality. We use the method proposed by edwards2022translation to evaluate the similarity between molecule-description pairs as an estimation of the data quality. The distribution is visualized using Kernel Distribution Estimation. A higher Text2Mol score signifies closer molecule-description resemblance, and "Density" represents the data concentration in a given region.
  • Figure 4: Different methods for utilizing pseudo data. Traditional training employs only the real dataset for fine-tuning. The data augmentation approach fine-tunes the model on the combined dataset with pseudo data incorporated. In the domain adaptation method, the model is (1) initially pre-trained on two concurrent cross-modal translation tasks using pseudo data as domain adaptation, and (2) further trained on each task using real data.
  • Figure 5: Results of molecular captioning task using different amount of pseudo data.
  • ...and 1 more figures