Table of Contents
Fetching ...

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting

Jiyue Jiang, Pengan Chen, Jiuming Wang, Dongchen He, Ziqin Wei, Liang Hong, Licheng Zong, Sheng Wang, Qinze Yu, Zixian Ma, Yanyu Chen, Yimin Fan, Xiangyu Shi, Jiawei Sun, Chuan Wu, Yu Li

TL;DR

This work proposes Bio-benchmark, a prompting-based framework for evaluating large language models across 30 bioinformatics tasks spanning sequence and text data. It introduces BioFinder to robustly extract key answers from complex LLM outputs, enabling reliable objective and subjective assessments without fine-tuning. Across six mainstream LLMs and 0-shot to few-shot CoT settings, the study reveals where current models excel (e.g., protein/RNA design and medical QA) and where they struggle (certain DDIs, complex TCM reasoning), while offering prompt engineering strategies to boost performance. The framework and tools advance practical application of LLMs in bioinformatics by providing a rigorous, scalable benchmarking and analysis pipeline with clear guidance for model development and task-specific prompting.

Abstract

Large language models (LLMs) have become important tools in solving biological problems, offering improvements in accuracy and adaptability over conventional methods. Several benchmarks have been proposed to evaluate the performance of these LLMs. However, current benchmarks can hardly evaluate the performance of these models across diverse tasks effectively. In this paper, we introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such as proteins, RNA, drugs, electronic health records, and traditional Chinese medicine. Using this benchmark, we evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought (CoT) settings without fine-tuning to reveal their intrinsic capabilities. To improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool for extracting answers from LLM responses, which increases extraction accuracy by round 30% compared to existing methods. Our benchmark results show the biological tasks suitable for current LLMs and identify specific areas requiring enhancement. Furthermore, we propose targeted prompt engineering strategies for optimizing LLM performance in these contexts. Based on these findings, we provide recommendations for the development of more robust LLMs tailored for various biological applications. This work offers a comprehensive evaluation framework and robust tools to support the application of LLMs in bioinformatics.

Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting

TL;DR

This work proposes Bio-benchmark, a prompting-based framework for evaluating large language models across 30 bioinformatics tasks spanning sequence and text data. It introduces BioFinder to robustly extract key answers from complex LLM outputs, enabling reliable objective and subjective assessments without fine-tuning. Across six mainstream LLMs and 0-shot to few-shot CoT settings, the study reveals where current models excel (e.g., protein/RNA design and medical QA) and where they struggle (certain DDIs, complex TCM reasoning), while offering prompt engineering strategies to boost performance. The framework and tools advance practical application of LLMs in bioinformatics by providing a rigorous, scalable benchmarking and analysis pipeline with clear guidance for model development and task-specific prompting.

Abstract

Large language models (LLMs) have become important tools in solving biological problems, offering improvements in accuracy and adaptability over conventional methods. Several benchmarks have been proposed to evaluate the performance of these LLMs. However, current benchmarks can hardly evaluate the performance of these models across diverse tasks effectively. In this paper, we introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such as proteins, RNA, drugs, electronic health records, and traditional Chinese medicine. Using this benchmark, we evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought (CoT) settings without fine-tuning to reveal their intrinsic capabilities. To improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool for extracting answers from LLM responses, which increases extraction accuracy by round 30% compared to existing methods. Our benchmark results show the biological tasks suitable for current LLMs and identify specific areas requiring enhancement. Furthermore, we propose targeted prompt engineering strategies for optimizing LLM performance in these contexts. Based on these findings, we provide recommendations for the development of more robust LLMs tailored for various biological applications. This work offers a comprehensive evaluation framework and robust tools to support the application of LLMs in bioinformatics.

Paper Structure

This paper contains 44 sections, 6 equations, 38 figures, 2 tables.

Figures (38)

  • Figure 1: Overview of the paper. Bio-benchmark is divided into sequence (Protein, RNA, RBP, Drug) and text (EHR, Medical, TCM) data (left). The process is to use six representative LLMs to generate answers for a total of 30 subtasks in seven domains in biological information through 0-shot and few-shot after the benchmark is built. After extracting the answers generated by LLMs using our proposed BioFinder, we evaluate and analyze the extracted answers of different LLMs (right). The experiment results table is in Appendix Table \ref{['complete']}, Figure \ref{['bb1']}, \ref{['bb2']}, and \ref{['res']}.
  • Figure 2: The performance of sequence and drug benchmarks in the Bio-benchmark. a, b represent overall performance. c, d, e indicate performance on protein benchmark. f, g, h show performance on RNA benchmark. i, j reflect performance on RBP benchmark. k, l demonstrate performance on drug benchmark.
  • Figure 3: The performance of biotext benchmarks in the Bio-benchmark. a, b represent performance on EHR benchmark. c, d indicate performance on Medical-QA benchmark. e, f show performance on TCM benchmark.
  • Figure 4: The proportion of Alignment by different methods.
  • Figure 5: The proportion of Alignment by different methods.
  • ...and 33 more figures