Long-range gene expression prediction with token alignment of large language model

Edouardo Honig; Huixin Zhan; Ying Nian Wu; Zijun Frank Zhang

Long-range gene expression prediction with token alignment of large language model

Edouardo Honig, Huixin Zhan, Ying Nian Wu, Zijun Frank Zhang

TL;DR

GTA is introduced, which aligns genetic sequence features with natural language tokens, allowing for symbolic reasoning of genomic sequence features via the frozen language model, and learns the regulatory grammar and allows for in-context learning that is not possible with existing models.

Abstract

Gene expression is a cellular process that plays a fundamental role in human phenotypical variations and diseases. Despite advances of deep learning models for gene expression prediction, recent benchmarks have revealed their inability to learn distal regulatory grammar. Here, we address this challenge by leveraging a pretrained large language model to enhance gene expression prediction. We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens, allowing for symbolic reasoning of genomic sequence features via the frozen language model. This cross-modal adaptation learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts, enabling in-context learning that is not possible with existing models. Trained on lymphoblastoid cells, GTA was evaluated on cells from the Geuvadis consortium and outperforms state-of-the-art models such as Enformer, achieving a Spearman correlation of 0.65, a 10\% improvement. Additionally, GTA offers improved interpretation of long-range interactions through the identification of the most meaningful sections of the input genetic context. GTA represents a powerful and novel cross-modal approach to gene expression prediction by utilizing a pretrained language model, in a paradigm shift from conventional gene expression models trained only on sequence data.

Long-range gene expression prediction with token alignment of large language model

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 10 figures, 3 tables)

This paper contains 11 sections, 1 equation, 10 figures, 3 tables.

Introduction
Related Work
Methods
Results
Baseline Comparison
Token Alignment
Ablation Study
Limitations
Conclusion
Sei Sequence Classes Names
Token Alignment Attention Heads

Figures (10)

Figure 1: Comparison of input genetic sequence length for gene expression prediction models. GTA enables a flexible range of input lengths, and we train models with input context from 200-1,000 kb.
Figure 2: Overview of GTA.
Figure 3: Detailed view of token alignment procedure.
Figure 4: NCBI Gene Annotation adopted from https://www.ncbi.nlm.nih.gov/gene/1.
Figure 5: GTA predictions on the evaluation data.
...and 5 more figures

Long-range gene expression prediction with token alignment of large language model

TL;DR

Abstract

Long-range gene expression prediction with token alignment of large language model

Authors

TL;DR

Abstract

Table of Contents

Figures (10)