Language modelling techniques for analysing the impact of human genetic variation
Megha Hegde, Jean-Christophe Nebel, Farzana Rahman
TL;DR
The paper surveys language modelling techniques applied to analyzing the impact of human genetic variation, tracing the evolution from pre-Transformer models (CNNs/RNNs) through Transformer-based architectures to post-Transformer approaches. It emphasizes input representations (DNA/RNA/protein sequences, mutations, MSAs) and the common pre-training/fine-tuning pipeline, while noting data scarcity and cross-species data usage. The review highlights foundational models such as DNABERT, DNABERT-2, Nucleotide Transformer, and ESM, evaluates performance trends across coding and non-coding tasks, and stresses benchmarking challenges and the need for standardised datasets. It also discusses limitations including computational cost, data bias, and privacy concerns, and points to promising directions in efficient architectures (Hyena/Mamba), zero-shot inference, and targeted benchmarks to accelerate clinically relevant deployment.
Abstract
Interpreting the effects of variants within the human genome and proteome is essential for analysing disease risk, predicting medication response, and developing personalised health interventions. Due to the intrinsic similarities between the structure of natural languages and genetic sequences, natural language processing techniques have demonstrated great applicability in computational variant effect prediction. In particular, the advent of the Transformer has led to significant advancements in the field. However, Transformer-based models are not without their limitations, and a number of extensions and alternatives have been developed to improve results and enhance computational efficiency. This review explores the use of language models for computational variant effect prediction over the past decade, analysing the main architectures, and identifying key trends and future directions.
