Table of Contents
Fetching ...

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Jianan Zhao, Xixian Liu, Zhihao Zhan, Xinyu Yuan, Hongyu Guo, Jian Tang

TL;DR

GeneZip is introduced, a DNA compression model that leverages a key biological prior that leverages a key biological prior: genomic information is highly imbalanced, enabling adaptive allocation of representation budget across genomic regions.

Abstract

Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

GeneZip: Region-Aware Compression for Long Context DNA Modeling

TL;DR

GeneZip is introduced, a DNA compression model that leverages a key biological prior that leverages a key biological prior: genomic information is highly imbalanced, enabling adaptive allocation of representation budget across genomic regions.

Abstract

Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.
Paper Structure (32 sections, 16 equations, 4 figures, 7 tables)

This paper contains 32 sections, 16 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: GeneZip overview. GeneZip compresses ultra-long DNA into a much shorter latent token sequence via hierarchical dynamic routing. During training, GeneZip uses Region-Aware Ratio (RAR) loss to learn a region-adaptive token compression budget, while bounded routing enforces global token-count constraints for stable training. The compressed tokens are then processed by an arbitrary token-mixing backbone for downstream long-range tasks. Notably, once trained, GeneZip can compress unseen raw DNA sequences without requiring annotations at inference time.
  • Figure 2: Why bounded routing is necessary. (a) In stage-1 compression, the router can be highly unstable early on and occasionally selects an excessively large number of tokens, which risks memory blow-up; a ceiling constraint prevents such pathological spikes. (b) In stage-2 compression, the router may collapse to selecting too few tokens (often near 1) at the beginning, which can induce training spikes and slow down optimization; a floor constraint enforces a minimum token budget for stable learning. (c) The floor constraint yields smoother and more favorable perplexity trajectories compared to training without it.
  • Figure 3: Inference efficiency on ultra-long inputs. End-to-end inference latency as a function of input sequence length (256 bp to 1M bp). GeneZip (70M/636M) remains consistently fast across the full range, while baseline long-context models exhibit sharply increasing latency as sequence length grows. The inset zooms into the short-to-mid regime (1K--131K) to highlight differences at smaller contexts.
  • Figure 4: Case studies of region-adaptive compression. Region labels are induced from GENCODE gene models gencode_frankish2019 (promoter/CDS/UTR/exon/intron/near-intergenic (NIG)), and the curves plot per-base boundary probability for H-Net hnet_hwang2025 and GeneZip. We highlight two loci on chr9: a promoter-dense transcript locus (chr9:136,980,237--137,030,237; top) and an intron-dominated gene locus (chr9:37,530,000--37,580,000; bottom). Across both loci, GeneZip concentrates boundary mass in promoters and transcribed features, while suppressing diffuse boundaries in long intronic or intergenic spans.