Table of Contents
Fetching ...

Cross-Species Antimicrobial Resistance Prediction from Genomic Foundation Models

Huilin Tai

Abstract

Cross-species antimicrobial resistance (AMR) prediction is fundamentally an out-of-distribution (OOD) generalization problem: models trained on one set of bacterial taxa must transfer to phylogenetically distinct genomes that may rely on different resistance mechanisms. Across species, resistance arises from a heterogeneous mixture of localized, horizontally transferred gene cassettes and diffuse species-specific genomic backgrounds, making successful transfer inherently mechanism-dependent. Using a strict species holdout protocol, we first establish an interpretable k-mer baseline with Kover and show that strong within-species performance collapses under true cross-species evaluation. This motivates representation-level approaches that preserve transferable biological signals rather than amplify phylogenetic shortcuts. We investigate genomic foundation model embeddings derived from Evo-1-8k-base and introduce diagnostics for layer selection based on activation scale, isotropy, effective rank, and cross-seed stability under native bfloat16 inference. These analyses identify a stability boundary in deeper layers and reveal that embeddings extracted near this boundary provide more robust representations for downstream prediction. To preserve localized resistance signals, we treat per-window embeddings as an ordered multivariate signal and apply MiniRocket to summarize multi-scale local activation patterns instead of relying on global pooling. Our results show that aggregation strategy plays a central role in cross-species AMR prediction and that preserving local activation patterns substantially improves generalization when resistance mechanisms are localized.

Cross-Species Antimicrobial Resistance Prediction from Genomic Foundation Models

Abstract

Cross-species antimicrobial resistance (AMR) prediction is fundamentally an out-of-distribution (OOD) generalization problem: models trained on one set of bacterial taxa must transfer to phylogenetically distinct genomes that may rely on different resistance mechanisms. Across species, resistance arises from a heterogeneous mixture of localized, horizontally transferred gene cassettes and diffuse species-specific genomic backgrounds, making successful transfer inherently mechanism-dependent. Using a strict species holdout protocol, we first establish an interpretable k-mer baseline with Kover and show that strong within-species performance collapses under true cross-species evaluation. This motivates representation-level approaches that preserve transferable biological signals rather than amplify phylogenetic shortcuts. We investigate genomic foundation model embeddings derived from Evo-1-8k-base and introduce diagnostics for layer selection based on activation scale, isotropy, effective rank, and cross-seed stability under native bfloat16 inference. These analyses identify a stability boundary in deeper layers and reveal that embeddings extracted near this boundary provide more robust representations for downstream prediction. To preserve localized resistance signals, we treat per-window embeddings as an ordered multivariate signal and apply MiniRocket to summarize multi-scale local activation patterns instead of relying on global pooling. Our results show that aggregation strategy plays a central role in cross-species AMR prediction and that preserving local activation patterns substantially improves generalization when resistance mechanisms are localized.
Paper Structure (77 sections, 2 equations, 16 figures, 2 tables, 2 algorithms)

This paper contains 77 sections, 2 equations, 16 figures, 2 tables, 2 algorithms.

Figures (16)

  • Figure 1: Category coverage among the top 30 targets, before (left) and after (right) filtering. The raw data shows extreme imbalance, with many antibiotics lacking sufficient species diversity. Post-filtering retains only targets that preserve diversity across all five partitions.
  • Figure 2: Sample counts among the top 30 targets, before (left) and after (right) filtering. Many antibiotics in the raw data have insufficient samples for stable evaluation. Post-filter sets maintain adequate per-partition counts to yield reliable confidence intervals.
  • Figure 3: Label prevalence for the six retained antibiotics by partition across three replicates. Despite filtering for species diversity and sample size, natural class imbalance persists, ranging from 5% to 65% resistance rates. Replicate stability confirms reproducible split generation.
  • Figure 4: Cross-species degradation in Kover. F1 across five partitions for six antibiotics (three runs where available). Green bands indicate same-species evaluation; pink bands indicate cross-species evaluation. The dashed line marks the transition to val_outside. Degradation and variance are drug dependent; tigecycline fails under extreme imbalance.
  • Figure 5: Isotropy by depth. Angular diversity increases through mid-layers, peaks at L9–L10, and collapses at L11 (ten seeds; min–max bands).
  • ...and 11 more figures