Table of Contents
Fetching ...

TPMM: Three-component Posterior Mixture Model Enables Robust Inverton Detection in Low-Depth Metagenomes and Suggests Potential Viral Invertons

Yi Lu, Jiaojiao Guan, Yang Shen, Jiayu Shang, Yanni Sun

Abstract

Bacterial phase variation enables reversible, locus-specific phenotypic switching, often driven by DNA inversion (invertons). To identify these events, researchers commonly rely on sequencing reads that provide orientation-specific support. Metagenomic sequencing, which captures total genetic material independent of cultivation, offers a powerful platform for the comprehensive study of invertons. However, computational inverton calling from metagenomic data is difficult at low sequencing depth: hard read-support cutoffs can miss true events, while sequence-only predictors lack read-backed interpretability and uncertainty quantification. To address this, we present TPMM, a three-component posterior mixture model for inverton calling in metagenomic data. TPMM explicitly incorporates sequencing depth to formulate inverton detection as a probabilistic mixture problem. Starting from candidates flanked by inverted repeats, the model classifies the candidates into noise, low-probability, or high-probability inversion signals using read evidence. Finally, TPMM assigns posterior probabilities as soft labels and applies cumulative Bayesian False Discovery Rate control to robustly identify true invertons. On two real gut metagenomic datasets, TPMM agrees well with PhaseFinder at high depth but recovers substantially more invertons under systematic downsampling, demonstrating superior performance in sparse-data regimes. We further examine potential reversible inversion elements in viral genomes and provide supporting analyses, suggesting a broader scope for inversion-mediated regulation.

TPMM: Three-component Posterior Mixture Model Enables Robust Inverton Detection in Low-Depth Metagenomes and Suggests Potential Viral Invertons

Abstract

Bacterial phase variation enables reversible, locus-specific phenotypic switching, often driven by DNA inversion (invertons). To identify these events, researchers commonly rely on sequencing reads that provide orientation-specific support. Metagenomic sequencing, which captures total genetic material independent of cultivation, offers a powerful platform for the comprehensive study of invertons. However, computational inverton calling from metagenomic data is difficult at low sequencing depth: hard read-support cutoffs can miss true events, while sequence-only predictors lack read-backed interpretability and uncertainty quantification. To address this, we present TPMM, a three-component posterior mixture model for inverton calling in metagenomic data. TPMM explicitly incorporates sequencing depth to formulate inverton detection as a probabilistic mixture problem. Starting from candidates flanked by inverted repeats, the model classifies the candidates into noise, low-probability, or high-probability inversion signals using read evidence. Finally, TPMM assigns posterior probabilities as soft labels and applies cumulative Bayesian False Discovery Rate control to robustly identify true invertons. On two real gut metagenomic datasets, TPMM agrees well with PhaseFinder at high depth but recovers substantially more invertons under systematic downsampling, demonstrating superior performance in sparse-data regimes. We further examine potential reversible inversion elements in viral genomes and provide supporting analyses, suggesting a broader scope for inversion-mediated regulation.
Paper Structure (14 sections, 8 equations, 5 figures, 1 table)

This paper contains 14 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the pipeline. (A) Acquisition of read support evidence. PhaseFinder is utilized to quantify read counts supporting both the forward and reverse orientations. (B) Statistical inference and false discovery control. After initial thresholding of read evidence, the data is analyzed using a Three-component Posterior Mixture Model (TPMM) to estimate posterior probabilities for each locus via the Expectation-Maximization(EM) algorithm. Candidate invertons are ranked in ascending order of their posterior probability of being noise ($P_{\text{noise}}$) to calculate the cumulative Bayesian False Discovery Rate (BFDR). The final set of invertons is identified by selecting the top $K$ candidates such that $\text{BFDR}(K) \leq \alpha$, where $\alpha$ is a user-defined threshold.
  • Figure 2: Heatmap of read evidence and posterior for selected samples from the human gut dataset. For each bin, we plot the median $P_{\mathrm{true}}$ across candidates falling into that bin. Bins with no observations are left blank. The yellow star indicates the bin with the largest positive enrichment (i.e., where TPMM yields the greatest excess of unique positives), whereas the purple star indicates the bin with the largest negative enrichment (i.e., where PhaseFinder yields the greatest excess of unique positives).
  • Figure 3: Composition of $I_{\mathrm{ref}}$ calls recovered under downsampling. DS20, DS40, and DS80denote downsampled datasets retaining 20%, 40%, and 80% of the original reads. Left: Human dataset; Right: Rat dataset. Stacked bars represent the proportion of $I_{\mathrm{ref}}$ identified by both methods (common; grey), detected only by TPMM (red), and detected only by PhaseFinder (blue).
  • Figure 4: Enrichment of downstream coding sequences near identified invertons. The observed hit rate ($T_{\mathrm{obs}}$, black line) is compared to a null distribution generated by contig-preserving permutation. The dark blue line represents the mean of the null distribution, and the light blue shaded area indicates the 95% confidence interval. (A) Analysis of the human gut dataset. (B) Analysis of the rat gut dataset. In both datasets, the observed proximity of invertons to downstream CDS start sites significantly exceeds the null expectation, particularly at small window sizes.
  • Figure 5: Genomic context and impact of identified invertons. (A) Schematic representation of coding sequences (CDSs) located within 1,000 bp downstream of or directly spanning the inverton region. Genes are color-coded by predicted function, including hypothetical proteins, major capsid protein (MCP), DNA polymerase, peptidase, and TerD domain-containing proteins. White rectangles with diagonal stripes indicate inverton sequences. (B) Representative example of a Microvirus inverton impacting the major capsid protein (MCP) gene