Table of Contents
Fetching ...

TEPI: Taxonomy-aware Embedding and Pseudo-Imaging for Scarcely-labeled Zero-shot Genome Classification

Sathyanarayanan Aakur, Vishalini R. Laguduva, Priyadharsini Ramamurthy, Akhilesh Ramachandran

TL;DR

The paper tackles scalable zero-shot genome classification in the face of an enormous and imbalanced species space by introducing TEPI, a framework that combines a taxonomy-aware embedding space $\mathcal{E}$ with pseudo-image representations $I(\mathcal{G})$ of whole genomes. It builds a taxonomic graph $\mathcal{T}$ and learns embeddings via node2vec to encode phylogenetic relationships, while representing genomes as relative k-mer co-occurrence images and mapping them to $\mathcal{E}$ using a CNN-based regressor $\phi$ trained with an $L_2$ loss. In extensive experiments on 93 bacterial species with sparse labeling, TEPI-Comp achieves strong generalized zero-shot performance and substantially reduced latency compared to BLAST, highlighting the approach's scalability and practical potential. Overall, TEPI provides a principled, image-based, taxonomy-aware path to open-world genome profiling that can integrate into diagnostic pipelines and support future extensions to 16S/23S sequencing data and point-of-care applications.

Abstract

A species' genetic code or genome encodes valuable evolutionary, biological, and phylogenetic information that aids in species recognition, taxonomic classification, and understanding genetic predispositions like drug resistance and virulence. However, the vast number of potential species poses significant challenges in developing a general-purpose whole genome classification tool. Traditional bioinformatics tools have made notable progress but lack scalability and are computationally expensive. Machine learning-based frameworks show promise but must address the issue of large classification vocabularies with long-tail distributions. In this study, we propose addressing this problem through zero-shot learning using TEPI, Taxonomy-aware Embedding and Pseudo-Imaging. We represent each genome as pseudo-images and map them to a taxonomy-aware embedding space for reasoning and classification. This embedding space captures compositional and phylogenetic relationships of species, enabling predictions in extensive search spaces. We evaluate TEPI using two rigorous zero-shot settings and demonstrate its generalization capabilities qualitatively on curated, large-scale, publicly sourced data.

TEPI: Taxonomy-aware Embedding and Pseudo-Imaging for Scarcely-labeled Zero-shot Genome Classification

TL;DR

The paper tackles scalable zero-shot genome classification in the face of an enormous and imbalanced species space by introducing TEPI, a framework that combines a taxonomy-aware embedding space with pseudo-image representations of whole genomes. It builds a taxonomic graph and learns embeddings via node2vec to encode phylogenetic relationships, while representing genomes as relative k-mer co-occurrence images and mapping them to using a CNN-based regressor trained with an loss. In extensive experiments on 93 bacterial species with sparse labeling, TEPI-Comp achieves strong generalized zero-shot performance and substantially reduced latency compared to BLAST, highlighting the approach's scalability and practical potential. Overall, TEPI provides a principled, image-based, taxonomy-aware path to open-world genome profiling that can integrate into diagnostic pipelines and support future extensions to 16S/23S sequencing data and point-of-care applications.

Abstract

A species' genetic code or genome encodes valuable evolutionary, biological, and phylogenetic information that aids in species recognition, taxonomic classification, and understanding genetic predispositions like drug resistance and virulence. However, the vast number of potential species poses significant challenges in developing a general-purpose whole genome classification tool. Traditional bioinformatics tools have made notable progress but lack scalability and are computationally expensive. Machine learning-based frameworks show promise but must address the issue of large classification vocabularies with long-tail distributions. In this study, we propose addressing this problem through zero-shot learning using TEPI, Taxonomy-aware Embedding and Pseudo-Imaging. We represent each genome as pseudo-images and map them to a taxonomy-aware embedding space for reasoning and classification. This embedding space captures compositional and phylogenetic relationships of species, enabling predictions in extensive search spaces. We evaluate TEPI using two rigorous zero-shot settings and demonstrate its generalization capabilities qualitatively on curated, large-scale, publicly sourced data.
Paper Structure (15 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the Overall Workflow of the proposed framework. We learn a taxonomy-aware embedding space during training that is used to predict taxonomies of query genomes during testing. Dotted red lines indicate learning.
  • Figure 2: Overall architecture of the proposed TEPI model. We aim to learn a common, taxonomy-aware representation space to which each species' genome representations can be mapped to ensure generalizable, zero-shot classification performance.
  • Figure 3: Iterative construction of the hierarchical taxonomy graph to capture phylogenetic relationships between species. We iteratively add nodes based on the species taxonomy from an empty graph to capture the inherently compositional relationships.
  • Figure 4: Genome-level complement-aware, pseudo-images of genomes from (a) Mycobacterium avium, (b) Mycobacterium tuberculosis, (c) Francisella tularensis and (d) Clostridium botulinum. Note the similar patterns between the former two images, which belong to species from the same genus Mycobacterium.
  • Figure 5: Qualitative visualization of the predicted taxonomy tree from top-5 predictions (on the right) for a given query genome (left) for an unseen genus. All retrieved species belong to the same family, indicating the compositional structure of the learned embedding space. The top-1 prediction is underlined, and the correct ones are in green.
  • ...and 1 more figures