Table of Contents
Fetching ...

Biological Sequence with Language Model Prompting: A Survey

Jiyue Jiang, Zikang Wang, Yuheng Shan, Heyan Chai, Jiayi Li, Zixian Ma, Xinrui Zhang, Yu Li

TL;DR

This survey addresses the problem of applying prompt-based methods with large language models to biological sequences across DNA, RNA, protein, and drug discovery tasks. It surveys how prompts and in-context learning recast domain problems as NLP problems to enable zero-/few-shot learning, with detailed coverage of DNA promoter identification, RNA functional-element analysis, protein structure and interaction tasks, and drug-target predictions, highlighting AlphaFold and ESM as pivotal technologies. The work identifies data scarcity, multimodal data fusion challenges, and computational resource demands as key bottlenecks, and proposes future directions in data-centric annotation, unified multimodal prompting, and efficient prompting techniques to advance AI-enabled bioinformatics. Overall, the paper serves as a foundational primer and roadmap for leveraging prompt engineering to accelerate biological sequence analysis and drug discovery.

Abstract

Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains. Notably, recent studies have demonstrated that large language models significantly enhance the efficiency of biomolecular analysis and synthesis, attracting widespread attention from academics and medicine. In this paper, we systematically investigate the application of prompt-based methods with LLMs to biological sequences, including DNA, RNA, proteins, and drug discovery tasks. Specifically, we focus on how prompt engineering enables LLMs to tackle domain-specific problems, such as promoter sequence prediction, protein structure modeling, and drug-target binding affinity prediction, often with limited labeled data. Furthermore, our discussion highlights the transformative potential of prompting in bioinformatics while addressing key challenges such as data scarcity, multimodal fusion, and computational resource limitations. Our aim is for this paper to function both as a foundational primer for newcomers and a catalyst for continued innovation within this dynamic field of study.

Biological Sequence with Language Model Prompting: A Survey

TL;DR

This survey addresses the problem of applying prompt-based methods with large language models to biological sequences across DNA, RNA, protein, and drug discovery tasks. It surveys how prompts and in-context learning recast domain problems as NLP problems to enable zero-/few-shot learning, with detailed coverage of DNA promoter identification, RNA functional-element analysis, protein structure and interaction tasks, and drug-target predictions, highlighting AlphaFold and ESM as pivotal technologies. The work identifies data scarcity, multimodal data fusion challenges, and computational resource demands as key bottlenecks, and proposes future directions in data-centric annotation, unified multimodal prompting, and efficient prompting techniques to advance AI-enabled bioinformatics. Overall, the paper serves as a foundational primer and roadmap for leveraging prompt engineering to accelerate biological sequence analysis and drug discovery.

Abstract

Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains. Notably, recent studies have demonstrated that large language models significantly enhance the efficiency of biomolecular analysis and synthesis, attracting widespread attention from academics and medicine. In this paper, we systematically investigate the application of prompt-based methods with LLMs to biological sequences, including DNA, RNA, proteins, and drug discovery tasks. Specifically, we focus on how prompt engineering enables LLMs to tackle domain-specific problems, such as promoter sequence prediction, protein structure modeling, and drug-target binding affinity prediction, often with limited labeled data. Furthermore, our discussion highlights the transformative potential of prompting in bioinformatics while addressing key challenges such as data scarcity, multimodal fusion, and computational resource limitations. Our aim is for this paper to function both as a foundational primer for newcomers and a catalyst for continued innovation within this dynamic field of study.

Paper Structure

This paper contains 33 sections, 33 equations, 5 figures.

Figures (5)

  • Figure 1: Biological sequence with language model prompting, and a RNA prompting case.
  • Figure 2: Timeline of key advancements in LLMs for computational biology.
  • Figure 3: Specific examples of the use of prompt methods in DNA, RNA, protein and drug.
  • Figure 4: Case study on the prompt-based generated DNA sequences related to Landau Kleffner Syndrome (LKS). (a) DeepSeek-R1 (RMSD = 21.87 Å, Tm-score = 0.20958) (b) GPT-4o (RMSD = 33.72 Å, Tm-score = 0.14644) (c) Llama-3.3-70b (RMSD = 23.73 Å, Tm-score = 0.14224) (d) Qwen-2.5-max (RMSD = 9.58 Å, Tm-score = 0.44619).
  • Figure 5: Literature taxonomy of LLMs in computational biology.