GeneZip: Region-Aware Compression for Long Context DNA Modeling

Jianan Zhao; Xixian Liu; Zhihao Zhan; Xinyu Yuan; Hongyu Guo; Jian Tang

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang

TL;DR

GeneZip is introduced, a DNA compression model that leverages a key biological prior that leverages a key biological prior: genomic information is highly imbalanced, enabling adaptive allocation of representation budget across genomic regions.

Abstract

Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

GeneZip: Region-Aware Compression for Long Context DNA Modeling

TL;DR

Abstract

Paper Structure (32 sections, 16 equations, 4 figures, 7 tables)

This paper contains 32 sections, 16 equations, 4 figures, 7 tables.

Introduction
Related Work
Methodology
Problem setup and motivation
GeneZip encoder: hierarchical genomic compression with region-aware supervision
Bounded Routing for Stable Training
Training and Inference
Experiments
Pretraining data and schedule
Pretraining evaluation (perplexity and budget control)
Case studies of GeneZip compression
Contact map prediction
eQTL prediction
Conclusion
Implementation Details
...and 17 more sections

Figures (4)

Figure 1: GeneZip overview. GeneZip compresses ultra-long DNA into a much shorter latent token sequence via hierarchical dynamic routing. During training, GeneZip uses Region-Aware Ratio (RAR) loss to learn a region-adaptive token compression budget, while bounded routing enforces global token-count constraints for stable training. The compressed tokens are then processed by an arbitrary token-mixing backbone for downstream long-range tasks. Notably, once trained, GeneZip can compress unseen raw DNA sequences without requiring annotations at inference time.
Figure 2: Why bounded routing is necessary. (a) In stage-1 compression, the router can be highly unstable early on and occasionally selects an excessively large number of tokens, which risks memory blow-up; a ceiling constraint prevents such pathological spikes. (b) In stage-2 compression, the router may collapse to selecting too few tokens (often near 1) at the beginning, which can induce training spikes and slow down optimization; a floor constraint enforces a minimum token budget for stable learning. (c) The floor constraint yields smoother and more favorable perplexity trajectories compared to training without it.
Figure 3: Inference efficiency on ultra-long inputs. End-to-end inference latency as a function of input sequence length (256 bp to 1M bp). GeneZip (70M/636M) remains consistently fast across the full range, while baseline long-context models exhibit sharply increasing latency as sequence length grows. The inset zooms into the short-to-mid regime (1K--131K) to highlight differences at smaller contexts.
Figure 4: Case studies of region-adaptive compression. Region labels are induced from GENCODE gene models gencode_frankish2019 (promoter/CDS/UTR/exon/intron/near-intergenic (NIG)), and the curves plot per-base boundary probability for H-Net hnet_hwang2025 and GeneZip. We highlight two loci on chr9: a promoter-dense transcript locus (chr9:136,980,237--137,030,237; top) and an intron-dominated gene locus (chr9:37,530,000--37,580,000; bottom). Across both loci, GeneZip concentrates boundary mass in promoters and transcribed features, while suppressing diffuse boundaries in long intronic or intergenic spans.

GeneZip: Region-Aware Compression for Long Context DNA Modeling

TL;DR

Abstract

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (4)