Table of Contents
Fetching ...

Learning Genomic Structure from $k$-mers

Filip Thor, Carl Nettelblad

TL;DR

This work addresses the problem of reconstructing and analyzing genomes from ultra-short reads by learning a general, task-agnostic representation of $k$-mers. It introduces CReadNet, a contrastive-learning framework that embeds $k$-mers into a continuous space reflecting genomic structure, with a domain-specific augmentation scheme and a coordinate-thresholded loss that scales positives by genomic distance via a threshold $\Gamma$ and a distance weight $d_{i,p}$. The approach yields a 256-dimensional embedding from a ConvNet encoder, with downstream heads that can predict exact coordinates through regression, bitwise bit-prediction, or GPT-based bit generation; it is demonstrated on $\textit{E. coli}$ data and extended to ancient-DNA read mapping, inversion/structural-variation detection, and metagenomic identification. The results show competitive read-mapping accuracy and scalable inference, suggesting practical impact for metagenomics and large-genome applications, while self-supervised training offers a route to analysis without full assemblies. Overall, the work provides a robust, extensible framework to learn genomic structure directly from $k$-mers, enabling efficient, versatile downstream analyses and potential de novo assembly approaches from read data.

Abstract

Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of $k$-mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the $E.\ coli$ genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we illustrate the potential of using this type of model for metagenomic species identification. We show how incorporating a domain-specific noise model can enhance embedding robustness, and how a supervised contrastive learning setting can be adopted when a linear reference genome is available, by introducing a distance thresholding parameter $Γ$. The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly using specialized algorithms. Small prediction heads based on a pre-trained embedding are shown to perform on par with BWA-aln, the current gold standard approach for aDNA mapping, in terms of accuracy and runtime for short genomes. Given the method's favorable scaling properties with respect to total genome size, inference using our approach is highly promising for metagenomic applications and for mapping to genomes comparable in size to the human genome.

Learning Genomic Structure from $k$-mers

TL;DR

This work addresses the problem of reconstructing and analyzing genomes from ultra-short reads by learning a general, task-agnostic representation of -mers. It introduces CReadNet, a contrastive-learning framework that embeds -mers into a continuous space reflecting genomic structure, with a domain-specific augmentation scheme and a coordinate-thresholded loss that scales positives by genomic distance via a threshold and a distance weight . The approach yields a 256-dimensional embedding from a ConvNet encoder, with downstream heads that can predict exact coordinates through regression, bitwise bit-prediction, or GPT-based bit generation; it is demonstrated on data and extended to ancient-DNA read mapping, inversion/structural-variation detection, and metagenomic identification. The results show competitive read-mapping accuracy and scalable inference, suggesting practical impact for metagenomics and large-genome applications, while self-supervised training offers a route to analysis without full assemblies. Overall, the work provides a robust, extensible framework to learn genomic structure directly from -mers, enabling efficient, versatile downstream analyses and potential de novo assembly approaches from read data.

Abstract

Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of -mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we illustrate the potential of using this type of model for metagenomic species identification. We show how incorporating a domain-specific noise model can enhance embedding robustness, and how a supervised contrastive learning setting can be adopted when a linear reference genome is available, by introducing a distance thresholding parameter . The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly using specialized algorithms. Small prediction heads based on a pre-trained embedding are shown to perform on par with BWA-aln, the current gold standard approach for aDNA mapping, in terms of accuracy and runtime for short genomes. Given the method's favorable scaling properties with respect to total genome size, inference using our approach is highly promising for metagenomic applications and for mapping to genomes comparable in size to the human genome.

Paper Structure

This paper contains 28 sections, 10 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Given a starting sequence, we draw two $k$-mers, $Aug(\cdot)$ adds aDNA noise, resulting in pairwise positive $k$-mers $x_i$ and $x_j$. A convolutional encoder model $f$ and a one-layer projection layer $g$ are trained to minimize the embedding distances between $z_i$ and $z_j$. The intermediate representation $h_i$ is used by a prediction head $P$ when predicting $k$-mer coordinates $c_i$.
  • Figure 2: Reads from the same position in the genome can be expressed differently based on which strand the sequence originates from, [TGCGTGG] and [CCACGCA] have the same bp coordinate.
  • Figure 3: Residual block layout.
  • Figure 4: Different prediction heads for the coordinate prediction task. a): An MLP using regression to predict $k$-mer position $\tilde{c}(x)$. b): MLP predicting the bitwise probabilities using the representation $h$. c): A small GPT predicting probabilities using $h$ and the previous bits.
  • Figure 5: 2D PCA of the resulting embedding over a 20kbp window, with $\Gamma = 1\,000$ (left) and $\Gamma=100$ (middle), and the distribution of the mean distance to the 10 closest neighbors for each sample (right).
  • ...and 6 more figures