Table of Contents
Fetching ...

Self-Distillation Improves DNA Sequence Inference

Tong Yu, Lei Cheng, Ruslan Khalitov, Erland Brandser Olsson, Zhirong Yang

TL;DR

FinDNA introduces a self-distillation framework for DNA sequence inference that jointly leverages within-sequence context through masked nucleotide modeling and across-sequence population signals via contrastive learning. The model uses a student-teacher architecture updated by exponential moving average, two augmented views, and a ChordMixer backbone to efficiently capture long-range dependencies. Pretrained on the human reference genome, FinDNA demonstrates consistent improvements across 20 downstream tasks spanning GenomicBenchmarks, GUE, and MTcDNA, including substantial gains over HyenaDNA and competitive performance versus larger pretraining baselines, while achieving greater parameter efficiency. This approach advances DNA sequence understanding by incorporating population-level statistics into SSL, with practical implications for regulatory element prediction, pathogen tracking, and cross-species genomics.

Abstract

Self-supervised pretraining (SSP) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSP approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a `student' and a `teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.

Self-Distillation Improves DNA Sequence Inference

TL;DR

FinDNA introduces a self-distillation framework for DNA sequence inference that jointly leverages within-sequence context through masked nucleotide modeling and across-sequence population signals via contrastive learning. The model uses a student-teacher architecture updated by exponential moving average, two augmented views, and a ChordMixer backbone to efficiently capture long-range dependencies. Pretrained on the human reference genome, FinDNA demonstrates consistent improvements across 20 downstream tasks spanning GenomicBenchmarks, GUE, and MTcDNA, including substantial gains over HyenaDNA and competitive performance versus larger pretraining baselines, while achieving greater parameter efficiency. This approach advances DNA sequence understanding by incorporating population-level statistics into SSL, with practical implications for regulatory element prediction, pathogen tracking, and cross-species genomics.

Abstract

Self-supervised pretraining (SSP) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSP approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a `student' and a `teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.
Paper Structure (21 sections, 7 equations, 1 figure, 9 tables)

This paper contains 21 sections, 7 equations, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Illustration of our self-supervised learning model: A. pretraining and B. fine-tuning and inference.