Table of Contents
Fetching ...

Automated Molecular Concept Generation and Labeling with Large Language Models

Zimin Zhang, Qianli Wu, Botao Xia, Fang Sun, Ziniu Hu, Yizhou Sun, Shichang Zhang

TL;DR

This work introduces AutoMolCo, an automated framework that uses Large Language Models to generate and label predictive molecular concepts, enabling simple, explainable predictors to match or surpass GNNs and LLM in-context learning on MoleculeNet and High-Throughput Experimentation datasets. It comprises concept generation, labeling via direct prompts, generated code, or external tools, and iterative refinement to improve concepts and predictions. Across diverse tasks, AutoMolCo achieves strong performance while providing interpretable explanations through linear coefficients, decision-tree splits, and targeted concept interventions. The approach reduces reliance on human domain knowledge and highlights the potential of LLM-driven explainability in molecular science, though it depends on LLM quality and requires careful handling of labeling challenges and potential hallucinations.

Abstract

Artificial intelligence (AI) is transforming scientific research, with explainable AI methods like concept-based models (CMs) showing promise for new discoveries. However, in molecular science, CMs are less common than black-box models like Graph Neural Networks (GNNs), due to their need for predefined concepts and manual labeling. This paper introduces the Automated Molecular Concept (AutoMolCo) framework, which leverages Large Language Models (LLMs) to automatically generate and label predictive molecular concepts. Through iterative concept refinement, AutoMolCo enables simple linear models to outperform GNNs and LLM in-context learning on several benchmarks. The framework operates without human knowledge input, overcoming limitations of existing CMs while maintaining explainability and allowing easy intervention. Experiments on MoleculeNet and High-Throughput Experimentation (HTE) datasets demonstrate that AutoMolCo-induced explainable CMs are beneficial for molecular science research.

Automated Molecular Concept Generation and Labeling with Large Language Models

TL;DR

This work introduces AutoMolCo, an automated framework that uses Large Language Models to generate and label predictive molecular concepts, enabling simple, explainable predictors to match or surpass GNNs and LLM in-context learning on MoleculeNet and High-Throughput Experimentation datasets. It comprises concept generation, labeling via direct prompts, generated code, or external tools, and iterative refinement to improve concepts and predictions. Across diverse tasks, AutoMolCo achieves strong performance while providing interpretable explanations through linear coefficients, decision-tree splits, and targeted concept interventions. The approach reduces reliance on human domain knowledge and highlights the potential of LLM-driven explainability in molecular science, though it depends on LLM quality and requires careful handling of labeling challenges and potential hallucinations.

Abstract

Artificial intelligence (AI) is transforming scientific research, with explainable AI methods like concept-based models (CMs) showing promise for new discoveries. However, in molecular science, CMs are less common than black-box models like Graph Neural Networks (GNNs), due to their need for predefined concepts and manual labeling. This paper introduces the Automated Molecular Concept (AutoMolCo) framework, which leverages Large Language Models (LLMs) to automatically generate and label predictive molecular concepts. Through iterative concept refinement, AutoMolCo enables simple linear models to outperform GNNs and LLM in-context learning on several benchmarks. The framework operates without human knowledge input, overcoming limitations of existing CMs while maintaining explainability and allowing easy intervention. Experiments on MoleculeNet and High-Throughput Experimentation (HTE) datasets demonstrate that AutoMolCo-induced explainable CMs are beneficial for molecular science research.
Paper Structure (27 sections, 14 figures, 8 tables)

This paper contains 27 sections, 14 figures, 8 tables.

Figures (14)

  • Figure 1: The prediction process of molecule properties is greatly illuminated with AutoMolCo. First, concepts are generated and labeled with an LLM. Then, a simple prediction model (e.g., linear regression) is fitted to achieve explainable predictions. LLM-relevant pieces are highlighted in green.
  • Figure 2: The AutoMolCo framework. Step 1: concept generation. Step 2: concept labeling with three different strategies. Step 3: fitting a prediction model and perform concept selection. These three steps are repeated for multiple iterations to achieve a refined list of meaningful concepts, where the selected concepts in each iteration are feedback to the LLM through prompting. LLM outputs are highlighted as green boxes.
  • Figure 3: RQ1: Concepts selected by AutoMolCo in three refinement iterations on FreeSolv. A detailed version in Appendix \ref{['app:concept_refine_full']} Figure \ref{['figure:concept_refine_full']}.
  • Figure 4: RQ4: Iterative refinement improves CM performance for (a) regression and (b) classification tasks; (c) RQ5: Coefficients of the linear regression model from AutoMolCo after 3 iterations on FreeSolv.
  • Figure 5: Prompts for generating concept labeling functions in Python code for the FreeSolv dataset.
  • ...and 9 more figures