Table of Contents
Fetching ...

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings

Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, Han Liu

TL;DR

DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space is introduced, and Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer.

Abstract

We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training. Model, codes, and data are publicly available at \url{https://github.com/MAGICS-LAB/DNABERT_S}.

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings

TL;DR

DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space is introduced, and Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer.

Abstract

We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (CLR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training. Model, codes, and data are publicly available at \url{https://github.com/MAGICS-LAB/DNABERT_S}.
Paper Structure (31 sections, 7 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 7 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: TSNE visualization of the DNA embeddings generated by different methods on a CAMI2 cami2 dataset with $50$ different species. Each point represents an individual DNA sequence, with the color coding indicating the species affiliation. Notably, DNABERT-S demonstrates a pronounced ability to cluster and segregate different species within the embedding space.
  • Figure 2: Overview of DNABERT-S's training process. We construct training data from massive reference genomes and train DNABERT-S with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy that progressively provides more challenging contrastive anchors to the model in two different phases. We propose the Manifold Instance Mixup (MI-Mix) objective that mixes the intermediate hidden states of different inputs to construct more challenging contrastive anchor.
  • Figure 3: Metagenomics Binning Results. The bin size represents the number of unique species identified by each model and different colors represent the F1 score of the identified species. With high F1 scores, DNABERT-S identifies many more species than the baselines.
  • Figure 4: Model's performance of species classification with varying numbers of training samples on $6$ datasets. Results on other $6$ datasets are consistent and are presented in Figure \ref{['fig:classification_remain']}.
  • Figure 5: Results of species classification on other $6$ datasets.
  • ...and 4 more figures