Table of Contents
Fetching ...

The limits of bio-molecular modeling with large language models : a cross-scale evaluation

Yaxin Xu, Yue Zhou, Tianyu Zhao, Fengwei An, Zhixiang Ren

Abstract

The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.

The limits of bio-molecular modeling with large language models : a cross-scale evaluation

Abstract

The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.

Paper Structure

This paper contains 18 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of BioMol-LLM-Bench. The framework addresses limitations in existing benchmarks by incorporating specialized bio-chemistry tools and enabling cross-scale evaluation with 26 tasks across 4 levels. The evaluation pipeline integrates automatic answer extraction and metrics computation. 13 models are compared under different experiment settings.
  • Figure 2: Performance of 13 LLMs among 8 classification tasks. The prediction accuracy are displayed with bars based on the left axis. The markers represent model output validity based on the right axis. Larger-parameter models rank among the top performers in overall accuracy.
  • Figure 3: The distributions of prediction results on eleven regression tasks. By calculating the difference between each model's predictions and the ground-truth labels across different samples for each task, the mean and variance are computed, from which the boxplot in each subplot is generated.
  • Figure 4: Model performance on generation tasks. Higher values indicate better model performance. a). For MOL_Syn and MOL_Resyn tasks, fingerprint similarities of the prediction results were calculated using three methods: MACCS, MORGAN, and RDKIT. These values are displayed as bars corresponding to the left y-axis. The markers that represent the validity of the output results correspond to right y-axis. b). The results for PROT_GO and DDI_Interact were evaluated with Levenshtein similarity and Meteor similarity metric, respectively.
  • Figure 5: LLM performance with and without tool integration across 4 representative tasks. (left) For regression-based tasks (MOL_Solubility and MOL_Freesolv), performance is evaluated using RMSE, where lower value bars indicate better performance. The dashed bars indicate values exceeding the coordinate range, and the actual value is displayed with white numerical text. (right) For classification tasks, performance is measured using accuracy. The markers represent model output validity.
  • ...and 1 more figures