Table of Contents
Fetching ...

Deciphering genomic codes using advanced NLP techniques: a scoping review

Shuyan Cheng, Yishu Wei, Yiliang Zhou, Zihan Xu, Drew N Wright, Jinze Liu, Yifan Peng

TL;DR

This scoping review surveys the integration of NLP and large language models with genomic sequencing data, focusing on tokenization, transformer architectures, and predictive annotation of regulatory elements. It synthesizes 26 studies (2021–2024) to reveal that k-mer tokenization and transformer-based models enable accurate predictions of methylation, transcriptional regulation, and RNA interactions, while also highlighting data accessibility and resource constraints. The findings underscore the potential of NLP-driven approaches to streamline large-scale genomic analyses and support personalized medicine, while calling for improved interpretability, standardized pipelines, and multimodal data integration. Overall, the work maps current capabilities, gaps, and opportunities for leveraging NLP/LLMs in genomics and biomedicine.

Abstract

Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type. Results: A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility. Discussion: The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability.

Deciphering genomic codes using advanced NLP techniques: a scoping review

TL;DR

This scoping review surveys the integration of NLP and large language models with genomic sequencing data, focusing on tokenization, transformer architectures, and predictive annotation of regulatory elements. It synthesizes 26 studies (2021–2024) to reveal that k-mer tokenization and transformer-based models enable accurate predictions of methylation, transcriptional regulation, and RNA interactions, while also highlighting data accessibility and resource constraints. The findings underscore the potential of NLP-driven approaches to streamline large-scale genomic analyses and support personalized medicine, while calling for improved interpretability, standardized pipelines, and multimodal data integration. Overall, the work maps current capabilities, gaps, and opportunities for leveraging NLP/LLMs in genomics and biomedicine.

Abstract

Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type. Results: A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility. Discussion: The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability.

Paper Structure

This paper contains 24 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Flowchart of the literature review process according to PRISMA guidelines.