BEND: Benchmarking DNA Language Models on biologically meaningful tasks

Frederikke Isa Marin; Felix Teufel; Marc Horlacher; Dennis Madsen; Dennis Pultz; Ole Winther; Wouter Boomsma

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, Wouter Boomsma

TL;DR

BEND provides a standardized, genome-scale benchmark for DNA language models by assembling seven biologically meaningful downstream tasks in the human genome. It assesses a wide range of pre-trained LMs with a lightweight downstream CNN on frozen embeddings, highlighting both the promise of LM representations and their current limitations in long-range genomic reasoning. The study finds that while some models (notably NT-MS) can rival expert baselines on certain tasks, no LM consistently outperforms specialized methods across all tasks, and long-range context remains a major challenge for genome-scale predictions. The findings emphasize that tokenization, pre-training data, and context length critically shape what DNA LMs can learn, motivating further work on long-range modeling and cross-task transfer in genome annotation pipelines.

Abstract

The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

TL;DR

Abstract

Paper Structure (56 sections, 4 figures, 14 tables)

This paper contains 56 sections, 4 figures, 14 tables.

Introduction
Background
Eukaryotic DNA organization and terminology
Language modeling for biomolecular sequences: From proteins to DNA
Related works
DNA language models
Supervised learning on DNA
Benchmark collections on DNA
Motivation of BEND
Tasks and Datasets
Gene finding
Enhancer annotation
Chromatin accessibility prediction
Histone modification prediction
CpG methylation prediction
...and 41 more sections

Figures (4)

Figure 1: The organization of eukaryotic genomic DNA. Numbers are indicative examples for the human genome. Genes are structured as alternating introns (average: 5,400 bp) and exons (average: 170 bp), and have a promoter regulatory element before their TSS. Enhancers can be thousands of bp away from the gene. DNA is wrapped around histone proteins and densely packed as a chromosome.
Figure A1: Length distribution of samples in the gene finding dataset.
Figure A2: Distance to main TSS distribution of the enhancer elements in the enhancer annotation dataset.
Figure A3: Length distribution of the enhancer elements in the dataset.

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

TL;DR

Abstract

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)