Learning Genomic Structure from $k$-mers
Filip Thor, Carl Nettelblad
TL;DR
This work addresses the problem of reconstructing and analyzing genomes from ultra-short reads by learning a general, task-agnostic representation of $k$-mers. It introduces CReadNet, a contrastive-learning framework that embeds $k$-mers into a continuous space reflecting genomic structure, with a domain-specific augmentation scheme and a coordinate-thresholded loss that scales positives by genomic distance via a threshold $\Gamma$ and a distance weight $d_{i,p}$. The approach yields a 256-dimensional embedding from a ConvNet encoder, with downstream heads that can predict exact coordinates through regression, bitwise bit-prediction, or GPT-based bit generation; it is demonstrated on $\textit{E. coli}$ data and extended to ancient-DNA read mapping, inversion/structural-variation detection, and metagenomic identification. The results show competitive read-mapping accuracy and scalable inference, suggesting practical impact for metagenomics and large-genome applications, while self-supervised training offers a route to analysis without full assemblies. Overall, the work provides a robust, extensible framework to learn genomic structure directly from $k$-mers, enabling efficient, versatile downstream analyses and potential de novo assembly approaches from read data.
Abstract
Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of $k$-mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the $E.\ coli$ genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we illustrate the potential of using this type of model for metagenomic species identification. We show how incorporating a domain-specific noise model can enhance embedding robustness, and how a supervised contrastive learning setting can be adopted when a linear reference genome is available, by introducing a distance thresholding parameter $Γ$. The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly using specialized algorithms. Small prediction heads based on a pre-trained embedding are shown to perform on par with BWA-aln, the current gold standard approach for aDNA mapping, in terms of accuracy and runtime for short genomes. Given the method's favorable scaling properties with respect to total genome size, inference using our approach is highly promising for metagenomic applications and for mapping to genomes comparable in size to the human genome.
