Table of Contents
Fetching ...

Pretrained-Guided Conditional Diffusion Models for Microbiome Data Analysis

Xinyuan Shi, Fangfang Zhu, Wenwen Min

TL;DR

This work tackles missing microbiome data in cancer cohorts by introducing mbVDiT, a masked conditional diffusion model that operates in a VAE-derived latent space and is guided by patient metadata. By combining a pretrained VAE with a score-based diffusion process and cross-attention conditioning, mbVDiT achieves superior imputation accuracy while preserving data distributions and inter-microbe relationships. Across three TCGA cancer datasets, mbVDiT outperforms multiple baselines in PCC, Cosine similarity, RMSE, and MAE, and demonstrates robustness to high sparsity. The approach also benefits from ablations showing the value of metadata conditioning and VAE pretraining, suggesting practical utility for integrating diverse microbiome datasets in imputation tasks.

Abstract

Emerging evidence indicates that human cancers are intricately linked to human microbiomes, forming an inseparable connection. However, due to limited sample sizes and significant data loss during collection for various reasons, some machine learning methods have been proposed to address the issue of missing data. These methods have not fully utilized the known clinical information of patients to enhance the accuracy of data imputation. Therefore, we introduce mbVDiT, a novel pre-trained conditional diffusion model for microbiome data imputation and denoising, which uses the unmasked data and patient metadata as conditional guidance for imputating missing values. It is also uses VAE to integrate the the other public microbiome datasets to enhance model performance. The results on the microbiome datasets from three different cancer types demonstrate the performance of our methods in comparison with existing methods.

Pretrained-Guided Conditional Diffusion Models for Microbiome Data Analysis

TL;DR

This work tackles missing microbiome data in cancer cohorts by introducing mbVDiT, a masked conditional diffusion model that operates in a VAE-derived latent space and is guided by patient metadata. By combining a pretrained VAE with a score-based diffusion process and cross-attention conditioning, mbVDiT achieves superior imputation accuracy while preserving data distributions and inter-microbe relationships. Across three TCGA cancer datasets, mbVDiT outperforms multiple baselines in PCC, Cosine similarity, RMSE, and MAE, and demonstrates robustness to high sparsity. The approach also benefits from ablations showing the value of metadata conditioning and VAE pretraining, suggesting practical utility for integrating diverse microbiome datasets in imputation tasks.

Abstract

Emerging evidence indicates that human cancers are intricately linked to human microbiomes, forming an inseparable connection. However, due to limited sample sizes and significant data loss during collection for various reasons, some machine learning methods have been proposed to address the issue of missing data. These methods have not fully utilized the known clinical information of patients to enhance the accuracy of data imputation. Therefore, we introduce mbVDiT, a novel pre-trained conditional diffusion model for microbiome data imputation and denoising, which uses the unmasked data and patient metadata as conditional guidance for imputating missing values. It is also uses VAE to integrate the the other public microbiome datasets to enhance model performance. The results on the microbiome datasets from three different cancer types demonstrate the performance of our methods in comparison with existing methods.
Paper Structure (25 sections, 14 equations, 5 figures, 6 tables)

This paper contains 25 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The architecture of the proposed mbVDiT, where a VAE-based pre-trained model is used to integrate other public microbiome datasets to enhance model performance. (A) Training process: The latent space data $\hat{x}_{0}$ after masking is subjected to noise to obtain $\hat{x}_{T}$. Then, a denoising network $f_{\theta}$ is trained to predict the noise under the condition. (B) Inference process: The trained denoising network $f_{\theta}$ denoises the random noise $x_{T}$ step-by-step under the guidance of the condition. The data after noise removal is then passed through a decoder, resulting in the predicted imputation for the microbiome data.
  • Figure 2: Visualization of the imputation performance of various baseline methods.
  • Figure 3: Comparison of alpha-diversity based on Shannon index measured from real data and the imputed data by different methods.
  • Figure 4: Visualization of Pearson correlation coefficients for imputed data between microbe and microbe.
  • Figure 5: (A) Heatmap of Pearson correlation coefficients between imputed and real microbiome data. (B) Box plot of Pearson correlation coefficients between imputed microbiome data and real microbiome data.