UncertainGen: Uncertainty-Aware Representations of DNA Sequences for Metagenomic Binning
Abdulkadir Celikkanat, Andres R. Masegosa, Mads Albertsen, Thomas D. Nielsen
TL;DR
This work tackles metagenomic binning under sequence-level uncertainty, arguing that deterministic representations struggle to separate fragments that belong to multiple genomes or have highly similar features. It proposes UncertainGen, a probabilistic embedding that represents each DNA fragment as a Gaussian with mean $\boldsymbol{\mu}$ and covariance $\mathbf{S}$, enabling an expanded latent space and a closed-form probabilistic similarity score $q$ derived from $K_{ij} = \alpha(\mathbf{S}_i + \mathbf{S}_j)$. The approach is trained with a contrastive objective and backed by theoretical results on distinguishability and expressivity, including $\ extit{epsilon}$-distinguishability and packing-number bounds. Empirically, UncertainGen delivers improvements over deterministic k-mer and LLM-based embeddings on real metagenomic datasets, producing robust, uncertainty-aware partitions with a lightweight, scalable architecture. This probabilistic framework enhances binning robustness and opens avenues for uncertainty-aware genomic analyses and future extensions beyond Gaussian representations.
Abstract
Metagenomic binning aims to cluster DNA fragments from mixed microbial samples into their respective genomes, a critical step for downstream analyses of microbial communities. Existing methods rely on deterministic representations, such as k-mer profiles or embeddings from large language models, which fail to capture the uncertainty inherent in DNA sequences arising from inter-species DNA sharing and from fragments with highly similar representations. We present the first probabilistic embedding approach, UncertainGen, for metagenomic binning, representing each DNA fragment as a probability distribution in latent space. Our approach naturally models sequence-level uncertainty, and we provide theoretical guarantees on embedding distinguishability. This probabilistic embedding framework expands the feasible latent space by introducing a data-adaptive metric, which in turn enables more flexible separation of bins/clusters. Experiments on real metagenomic datasets demonstrate the improvements over deterministic k-mer and LLM-based embeddings for the binning task by offering a scalable and lightweight solution for large-scale metagenomic analysis.
