Table of Contents
Fetching ...

Language modelling techniques for analysing the impact of human genetic variation

Megha Hegde, Jean-Christophe Nebel, Farzana Rahman

TL;DR

The paper surveys language modelling techniques applied to analyzing the impact of human genetic variation, tracing the evolution from pre-Transformer models (CNNs/RNNs) through Transformer-based architectures to post-Transformer approaches. It emphasizes input representations (DNA/RNA/protein sequences, mutations, MSAs) and the common pre-training/fine-tuning pipeline, while noting data scarcity and cross-species data usage. The review highlights foundational models such as DNABERT, DNABERT-2, Nucleotide Transformer, and ESM, evaluates performance trends across coding and non-coding tasks, and stresses benchmarking challenges and the need for standardised datasets. It also discusses limitations including computational cost, data bias, and privacy concerns, and points to promising directions in efficient architectures (Hyena/Mamba), zero-shot inference, and targeted benchmarks to accelerate clinically relevant deployment.

Abstract

Interpreting the effects of variants within the human genome and proteome is essential for analysing disease risk, predicting medication response, and developing personalised health interventions. Due to the intrinsic similarities between the structure of natural languages and genetic sequences, natural language processing techniques have demonstrated great applicability in computational variant effect prediction. In particular, the advent of the Transformer has led to significant advancements in the field. However, Transformer-based models are not without their limitations, and a number of extensions and alternatives have been developed to improve results and enhance computational efficiency. This review explores the use of language models for computational variant effect prediction over the past decade, analysing the main architectures, and identifying key trends and future directions.

Language modelling techniques for analysing the impact of human genetic variation

TL;DR

The paper surveys language modelling techniques applied to analyzing the impact of human genetic variation, tracing the evolution from pre-Transformer models (CNNs/RNNs) through Transformer-based architectures to post-Transformer approaches. It emphasizes input representations (DNA/RNA/protein sequences, mutations, MSAs) and the common pre-training/fine-tuning pipeline, while noting data scarcity and cross-species data usage. The review highlights foundational models such as DNABERT, DNABERT-2, Nucleotide Transformer, and ESM, evaluates performance trends across coding and non-coding tasks, and stresses benchmarking challenges and the need for standardised datasets. It also discusses limitations including computational cost, data bias, and privacy concerns, and points to promising directions in efficient architectures (Hyena/Mamba), zero-shot inference, and targeted benchmarks to accelerate clinically relevant deployment.

Abstract

Interpreting the effects of variants within the human genome and proteome is essential for analysing disease risk, predicting medication response, and developing personalised health interventions. Due to the intrinsic similarities between the structure of natural languages and genetic sequences, natural language processing techniques have demonstrated great applicability in computational variant effect prediction. In particular, the advent of the Transformer has led to significant advancements in the field. However, Transformer-based models are not without their limitations, and a number of extensions and alternatives have been developed to improve results and enhance computational efficiency. This review explores the use of language models for computational variant effect prediction over the past decade, analysing the main architectures, and identifying key trends and future directions.

Paper Structure

This paper contains 20 sections, 1 equation, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Illustration of coding vs non-coding DNA, and a SNP in a promoter region, for a eukaryotic cell. Non-coding DNA consists of transcription factors, such as promoters, and transcription factor binding sites. Promoters drive the initiation of transcription andersson2020determinants. Other cis-regulatory elements (CREs) include enhancers and silencers, which positively and negatively regulate gene expression, respectively. Insulators are an additional type of CRE, which interact with nearby CREs and can block distal enhancers, or regulate chromatin interactions west2002insulators. Created in BioRender. Hegde, M. (2024) https://BioRender.com/e16b233. Alt text: Illustration of DNA, with coding region (gene) and key non-coding regions (promoter, insulator, transcription factor binding sites) highlighted and labelled, and a visual representation of a SNP.
  • Figure 2: Generic language modelling pipeline, including the main categories of tasks covered in this review. The DNA, RNA, or protein sequences are tokenised before being input to the model. The model is initially pre-trained on a large corpus of data, and then fine-tuned on a dataset specific to the planned downstream tasks, for example, variant pathogenicity classification. Icons from Biorender https://app.biorender.com/. Alt text: Flowchart showing the language modelling pipeline from inputs to outputs.
  • Figure 3: Timeline of models from 1980 until the development of the Transformer. Classical ML refers to classical machine learning techniques such as support vector machine and Naive Bayes. FFNN = feed-forward neural network; CNN = convolutional neural network; LSTM = long short-term memory. Markov models are often used to construct grammars galley2007lexicalizedzhu2008unsupervised. Alt text: Timeline showing the year of emergence of different language modelling techniques, from 1980 to the development of the Transformer in 2017.
  • Figure 4: Timeline of developments in NLP since 2017. Alt text: Radial timeline showing the years in which impactful language modelling technologies were developed, starting with the Transformer in 2017.
  • Figure 5: Analysis of the number of published papers, and the number of annual citations for the highest-impact papers. (a) Number of papers published per year on language models for variant effect prediction, as described in Tables \ref{['tab:neural']}, \ref{['tab:transformer_models']}, and \ref{['tab:post_transformer_models']}. Neural LM = neural language models (Table \ref{['tab:neural']}). LLM refers to both Transformer-based and post-Transformer models (Tables \ref{['tab:transformer_models']} and \ref{['tab:post_transformer_models']}). During the period 2018-24, the overall number of papers per year has generally increased, with a slight decrease from 2023 to 2024. The number of LLM papers has far exceeded the number of neural LM papers each year. (b) Number of citations per year for the most impactful papers. The number of citations per year for these papers has steadily increased since their publication. Alt text: Graphs on paper publication and citation data with sub-figures labelled a and b.
  • ...and 4 more figures