Table of Contents
Fetching ...

A quantitative analysis of knowledge-learning preferences in large language models in molecular science

Pengfei Liu, Jun Tao, Zhixiang Ren

TL;DR

The paper introduces ChEBI-20-MM, a comprehensive multi-modal benchmark for assessing large language models in molecular science across internal modalities (SMILES, InChI, SELFIES, graphs) and external modalities (captions, IUPAC names, images). It deploys a modal transition probability matrix and a statistically interpretable localized feature filtering approach to uncover knowledge-learning preferences, supported by 1263 experiments. Key findings show that T5-based encoders excel in text-driven tasks, that certain modalities are best suited for captioning or retrieval, and that the proposed token-mapping analysis yields interpretable chemical knowledge mappings (e.g., IUPAC-to-caption, SELFIES-to-caption). The framework offers a principled path to optimize modality selection, model architectures, and training strategies for molecular SLM applications, with broad implications for molecular design and discovery.

Abstract

Deep learning has significantly advanced molecular modeling and design, enabling efficient understanding and discovery of novel molecules. In particular, large language models (LLMs) introduce a fresh research paradigm to tackle scientific problems from a natural language processing (NLP) perspective. LLMs significantly enhance our understanding and generation of molecules, often surpassing existing methods with their capabilities to decode and synthesize complex molecular patterns. However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge mapping by localized feature filtering. Our analysis offers an exploration of the learning mechanism and paves the way for advancing LLMs in molecular science.

A quantitative analysis of knowledge-learning preferences in large language models in molecular science

TL;DR

The paper introduces ChEBI-20-MM, a comprehensive multi-modal benchmark for assessing large language models in molecular science across internal modalities (SMILES, InChI, SELFIES, graphs) and external modalities (captions, IUPAC names, images). It deploys a modal transition probability matrix and a statistically interpretable localized feature filtering approach to uncover knowledge-learning preferences, supported by 1263 experiments. Key findings show that T5-based encoders excel in text-driven tasks, that certain modalities are best suited for captioning or retrieval, and that the proposed token-mapping analysis yields interpretable chemical knowledge mappings (e.g., IUPAC-to-caption, SELFIES-to-caption). The framework offers a principled path to optimize modality selection, model architectures, and training strategies for molecular SLM applications, with broad implications for molecular design and discovery.

Abstract

Deep learning has significantly advanced molecular modeling and design, enabling efficient understanding and discovery of novel molecules. In particular, large language models (LLMs) introduce a fresh research paradigm to tackle scientific problems from a natural language processing (NLP) perspective. LLMs significantly enhance our understanding and generation of molecules, often surpassing existing methods with their capabilities to decode and synthesize complex molecular patterns. However, two key issues remain: how to quantify the match between model and data modalities and how to identify the knowledge-learning preferences of models. To address these challenges, we propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition. Through the modal transition probability matrix, we provide insights into the most suitable modalities for tasks. Furthermore, we introduce a statistically interpretable approach to discover context-specific knowledge mapping by localized feature filtering. Our analysis offers an exploration of the learning mechanism and paves the way for advancing LLMs in molecular science.
Paper Structure (5 sections, 5 equations, 9 figures, 12 tables)

This paper contains 5 sections, 5 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: The paradigm of the analysis.a. Molecular modeling and design tasks, showcasing six task types with their standard modeling methods and data examples. b. The paradigms of tasks, we divide common molecular data into two categories: internal and external information. Internal information, integral to molecular representation, can be converted through various tools. External information is more accessible to human understanding. Additionally, this part highlights the research scope of our analysis, detailing the input and output for each task.
  • Figure 2: Results of benchmark.a. Modal transition probability matrix. This matrix presents the performance in text generation and property prediction tasks. The vertical axis represents input modalities, while the horizontal axis denotes output modalities. b. Encoders and decoders in nine text-to-text tasks. This illustration highlights the frequency of various models appearing in the top 5 rankings. The T5-based models exhibit a dominant presence. c. Encoders, pooling mechanisms, and retrieval performance in embedding tasks. Alongside model rankings, the figure indicates that average pooling is a preferred choice for the pooling layer.
  • Figure 3: Knowledge patterns and insights.a. Tokens mapping matrix and threshold $T$ analysis. The two matrices represent the high-frequency tokens mapping patterns generated by the processes from IUPAC names and SELFIES to molecular captions. On the right of the figure, as the threshold $T$ increases, the selection criteria for identifying specific high-frequency token pairs become more stringent, consequently reducing their number and impacting the significance levels. b. Case studies of knowledge-learning preferences. These cases are selected from model inference results, where the mapping of tokens exemplifies the model's preferences for knowledge learning.
  • Figure 4: Performance of multi-modal fusion.a. Multi-modal fusion performance for molecular property predictions.This figure displays the AUC-ROC results for various molecular property prediction classification tasks. It compares the performance of the SciBERT and MolT5 models as encoders using SMILES (S) as input text and BioT5 using SELFIES as input text, with the graph model (GIN) utilizing graph data (G). In each subplot, the final results contributed by the vectors obtained after encoding and pooling from the foundation models are shown. "add" represents vector addition, "weight_add" represents adaptive weighted vector addition, "concat" represents concatenated encoding followed by pooling, and "attention" represents concatenated encoding processed by the attention mechanism before pooling. Different colors represent different tasks. b. Multi-modal fusion performance for molecule captioning.This panel shows the performance of six textual similarity metrics across two common datasets for molecule captioning tasks. The x-axis represents the models and input modalities, while the y-axis represents the metric values. Each color corresponds to a different metric.
  • Figure 5: An overview of model tasks and architectures.a. Tasks and models. It clarifies the relationship between six downstream tasks and model architectures. b. Encoder-decoder model Architectures. It delineates three main frameworks: (1) text-text is primarily focused on text translation tasks; (2) graph-text is predominantly used in contrastive learning frameworks and serves as an encoder for downstream tasks; (3) image-text is chiefly applied in molecular image recognition tasks.
  • ...and 4 more figures