Genomic Language Models: Opportunities and Challenges

Gonzalo Benegas; Chengzhong Ye; Carlos Albors; Jianan Canal Li; Yun S. Song

Genomic Language Models: Opportunities and Challenges

Gonzalo Benegas, Chengzhong Ye, Carlos Albors, Jianan Canal Li, Yun S. Song

TL;DR

This work discusses major considerations for developing and evaluating Genomic Language Models (gLMs), and highlights key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning.

Abstract

Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.

Genomic Language Models: Opportunities and Challenges

TL;DR

Abstract

Paper Structure (14 sections, 3 figures, 1 table)

This paper contains 14 sections, 3 figures, 1 table.

INTRODUCTION
APPLICATIONS
DEVELOPMENT
CONCLUDING REMARKS & FUTURE PERSPECTIVES

Figures (3)

Figure 1: Training and applications of gLMs. The schematic on the left-hand side illustrates gLM training. The log-likelihood ratio (LLR) between two alleles (specifically, $\log[\mathbb{P}(X_i = a \mid X_{-i})/\mathbb{P}(X_i = b \mid X_{-i})]$) is a good unsupervised predictor of functional constraint (\ref{['sec:functional-constraint']}). New sequences can be generated by sampling from the learned probability distribution (\ref{['sec:generation']}). A vector representation, called embedding, of each token in the input sequence can be extracted and adapted for different downstream tasks (\ref{['sec:transfer-learning']}).
Figure 2: Application examples.(a) gLM predicted logo plot (top) at a promoter, highlighting a motif (bottom logo) that matches a putative functional TFBS. (b) Correlation between variant minor allele frequency (MAF) and gLM score (log-likelihood ratio). (c) A gLM can be prompted with different control tags to design promoter sequences driving high or low expression in a given cell type. (d) Visualization of gLM embeddings for different classes of genomic windows, illustrating that the learned representations contain useful information such as gene regions. Note: Panels a,b,d were generated using the GPN model.
Figure 3: Development Pipeline. This figure illustrates the general gLM development pipeline described in this review, from model conception to deployment. We begin with the selection and preparation of the training dataset, emphasizing the importance of data quality and quantity (\ref{['sec:learning-data']}). Subsequently, in \ref{['sec:architecture']} and \ref{['sec:learning']}, we explore the various choices for designing and training gLMs, discussing the strengths and weaknesses of different approaches. We also examine how hybrid models combine elements from multiple architectures to mitigate specific limitations. In \ref{['sec:Interpretation']}, we discuss methods for analyzing and interpreting the outputs of gLMs. Finally, in \ref{['sec:evaluation']}, we present evaluation methods through current benchmarks, emphasizing the complexities in aligning model performance with actual biological functions.

Genomic Language Models: Opportunities and Challenges

TL;DR

Abstract

Genomic Language Models: Opportunities and Challenges

Authors

TL;DR

Abstract

Table of Contents

Figures (3)