Table of Contents
Fetching ...

GENERator: A Long-Context Generative Genomic Foundation Model

Wei Wu, Qiuyi Li, Yuanyuan Zhang, Zhihao Zhan, Ruipu Chen, Mingyang Li, Kun Fu, Junyan Qi, Yongzhou Bao, Chao Wang, Yiheng Zhu, Zhiyun Zhang, Jian Tang, Fuli Feng, Jieping Ye, Yuwen Liu, Hui Xiong, Zheng Wang

TL;DR

GENERator addresses the challenge of interpreting and engineering genomic sequences by introducing a long-context generative genomic foundation model trained on a biologically curated, gene-centric corpus of eukaryotic DNA. It demonstrates strong intrinsic representations and efficient, high-fidelity sequence generation using a 6-mer tokenization within a transformer framework, and it extends to practical tasks including alignment-free variant effect prediction, central dogma-consistent protein design, and prompt-guided CRE design with experimental validation. The work highlights critical design choices—biologically informed data, domain-specific tokenization, and effective long-context integration—for building practical genomic foundation models, and it provides a path toward controllable, experiment-backed genomic design. The open-access resources and modular framework support broad adoption in functional genomics and synthetic biology, illustrating how AI can meaningfully accelerate genome interpretation and design.

Abstract

The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. Without task-specific fine-tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state-of-the-art models with substantially improved computational efficiency. In a zero-shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. With task-specific fine-tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic super-enhancers validated by high-throughput UMI-STARR-seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at https://github.com/GenerTeam/GENERator.

GENERator: A Long-Context Generative Genomic Foundation Model

TL;DR

GENERator addresses the challenge of interpreting and engineering genomic sequences by introducing a long-context generative genomic foundation model trained on a biologically curated, gene-centric corpus of eukaryotic DNA. It demonstrates strong intrinsic representations and efficient, high-fidelity sequence generation using a 6-mer tokenization within a transformer framework, and it extends to practical tasks including alignment-free variant effect prediction, central dogma-consistent protein design, and prompt-guided CRE design with experimental validation. The work highlights critical design choices—biologically informed data, domain-specific tokenization, and effective long-context integration—for building practical genomic foundation models, and it provides a path toward controllable, experiment-backed genomic design. The open-access resources and modular framework support broad adoption in functional genomics and synthetic biology, illustrating how AI can meaningfully accelerate genome interpretation and design.

Abstract

The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. Without task-specific fine-tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state-of-the-art models with substantially improved computational efficiency. In a zero-shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. With task-specific fine-tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic super-enhancers validated by high-throughput UMI-STARR-seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at https://github.com/GenerTeam/GENERator.

Paper Structure

This paper contains 74 sections, 5 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 1: Overview of the GENERator model. (A) GENERator is pretrained on large-scale eukaryotic genomic sequences from the RefSeq database, spanning all major eukaryotic lineages. We adopt a functional sequence training strategy that leverages RefSeq annotations to extract gene-centric functional regions, and perform pretraining exclusively on these biologically meaningful sequences. (B) GENERator is a generative DNA language model based on a transformer decoder architecture with 6-mer tokenization, pretrained using an autoregressive next-token prediction objective to learn sequence dependencies and regulatory grammars from functional genomic data. (C) GENERator enables both training-free (zero-shot) and task-specific fine-tuning applications. Zero-shot tasks include genomic embedding representation, prompt-conditioned sequence recovery and generation, and variant effect prediction for benign versus pathogenic mutations. Fine-tuned tasks include supervised sequence classification and regression (e.g., promoter classification and enhancer activity prediction), central-dogma tasks involving protein-coding sequence generation and translation, and prompt-responsive cis-regulatory element (CRE) design for generating enhancers with specified activity profiles. (D) Top: phylogenetic tree illustrating the eukaryotic species represented in the RefSeq dataset. Bottom: UMAP visualizations of genome embeddings learned by GENERator, showing taxonomically consistent clustering. Increasing sequence length (16k to 96k) and model size (1B to 3B parameters) leads to progressively improved embedding separation, indicating clear scaling effects in representation quality.
  • Figure 2: Comprehensive benchmarking of GENERator. (A) Sequence recovery accuracy is compared across representative DNA foundation models. GENERator and Evo2 substantially outperform all other baseline models. At matched parameter scales, GENERator-1B consistently exceeds Evo2-1B in overall performance while achieving markedly higher generation efficiency. Scaling up to GENERator-3B yields clear performance gains over the 1B model and approaches the performance of Evo2-7B. (B) Generation time is measured on a single L40S GPU by conditioning on a 6k input sequence and generating the next 50 bp. GENERator exhibits tens-fold faster generation speed compared to Evo2 across model sizes, highlighting its substantially improved computational efficiency. (C) Sequence recovery performance is evaluated under identical model architectures, batch sizes, and training steps across six taxonomic groups. The 6-mer tokenizer achieves the best overall performance, whereas BPE tokenization performs consistently worse, likely due to the hierarchical nature of the BPE vocabulary. (D) Performance on ClinVar variant effect prediction is evaluated using AUPRC and AUROC. GENERator and Evo2 significantly outperform other sequence-based self-supervised models. GENERator-1B consistently surpasses Evo2-1B, while GENERator-3B achieves performance comparable to Evo2-7B. Although MSA-based methods such as GPN-MSA and CADD retain advantages on this task, GENERator and Evo2 operate without MSA information, enabling straightforward application to non-model organisms. (E) Comparison across diverse supervised benchmarks, including Revised NT tasks, Original NT tasks, Genomic Benchmarks, and Gene tasks, shows that GENERator achieves the strongest overall performance across all evaluated settings. (F-G) Central-dogma tasks are evaluated on the cytochrome P450 family (F) and histone family (G). Generated DNA sequences exhibit stable and continuous coding structures. Translated proteins show low perplexity under the protein language model ProGen2, indicating high protein-likeness. Structural validation using AlphaFold3 followed by Foldseek reveals that most generated proteins fall within high-confidence regions. Importantly, many generated proteins display strong structural similarity to natural counterparts despite low sequence identity ($<0.3$), demonstrating that GENERator captures protein grammar rather than memorizing natural sequences.
  • Figure 3: GENERator enables accurate design of CREs with dynamic ranges exceeding those of natural sequences. (A) Correlation between experimentally measured and model-predicted activities of CREs in the DeepSTARR hold-out test set. (B) Schematic overview of the UMI-STARR-seq library construction workflow using developmental (DSCP; Dev) and housekeeping (RpS12; Hk) core promoters. GENERator-designed sequences were synthesized as pooled oligonucleotides, cloned into reporter plasmids, and transfected into S2 cells. CRE activity was quantified as enrichment of UMI counts in RNA output relative to plasmid input. (C) Comparison of experimental activities between GENERator-designed and natural sequences. For GENERator-designed sequences: top 100 from the <high> group, bottom 100 from the <low> group, and 100 randomly selected from the <mid> group. For natural sequences: top 100 highest-activity and bottom 100 lowest-activity sequences from the DeepSTARR dataset. (D) Transcription factor DNA-binding motifs enriched in GENERator-designed sequences of different activity levels. (E) Motif composition of two representative GENERator-designed sequences. Blue lines: insect transcription factor motifs; blue boxes: vertebrate transcription factor motifs; red boxes: unannotated high-contribution regions.
  • Figure 4: Overview of the Gener Project. This figure illustrates the modular architecture of the Gener Project, dividing tasks among four distinct experts. This design enables individual models to be deployed, updated, and scaled independently, simplifying maintenance and reducing resource demands.
  • Figure S1: Detailed model performance across three benchmark suites. GENERator-1B achieves the highest average score across all tasks while demonstrating exceptional performance in individual benchmarks, securing first place in 32 out of 47 tasks and second place in 10 tasks. It consistently outperforms the control model GENERator-All, validating the effectiveness of functional sequence training.
  • ...and 9 more figures