Table of Contents
Fetching ...

Interpreting Microbiome Relative Abundance Data Using Symbolic Regression

Swagatam Haldar, Christoph Stein-Thoeringer, Vadim Borisov

TL;DR

This paper explores the application of symbolic regression to microbiome relative abundance data, with a focus on colorectal cancer, and indicates that SR not only competes reasonably well in terms of predictive performance, but also excels in model interpretability.

Abstract

Understanding the complex interactions within the microbiome is crucial for developing effective diagnostic and therapeutic strategies. Traditional machine learning models often lack interpretability, which is essential for clinical and biological insights. This paper explores the application of symbolic regression (SR) to microbiome relative abundance data, with a focus on colorectal cancer (CRC). SR, known for its high interpretability, is compared against traditional machine learning models, e.g., random forest, gradient boosting decision trees. These models are evaluated based on performance metrics such as F1 score and accuracy. We utilize 71 studies encompassing, from various cohorts, over 10,000 samples across 749 species features. Our results indicate that SR not only competes reasonably well in terms of predictive performance, but also excels in model interpretability. SR provides explicit mathematical expressions that offer insights into the biological relationships within the microbiome, a crucial advantage for clinical and biological interpretation. Our experiments also show that SR can help understand complex models like XGBoost via knowledge distillation. To aid in reproducibility and further research, we have made the code openly available at https://github.com/swag2198/microbiome-symbolic-regression .

Interpreting Microbiome Relative Abundance Data Using Symbolic Regression

TL;DR

This paper explores the application of symbolic regression to microbiome relative abundance data, with a focus on colorectal cancer, and indicates that SR not only competes reasonably well in terms of predictive performance, but also excels in model interpretability.

Abstract

Understanding the complex interactions within the microbiome is crucial for developing effective diagnostic and therapeutic strategies. Traditional machine learning models often lack interpretability, which is essential for clinical and biological insights. This paper explores the application of symbolic regression (SR) to microbiome relative abundance data, with a focus on colorectal cancer (CRC). SR, known for its high interpretability, is compared against traditional machine learning models, e.g., random forest, gradient boosting decision trees. These models are evaluated based on performance metrics such as F1 score and accuracy. We utilize 71 studies encompassing, from various cohorts, over 10,000 samples across 749 species features. Our results indicate that SR not only competes reasonably well in terms of predictive performance, but also excels in model interpretability. SR provides explicit mathematical expressions that offer insights into the biological relationships within the microbiome, a crucial advantage for clinical and biological interpretation. Our experiments also show that SR can help understand complex models like XGBoost via knowledge distillation. To aid in reproducibility and further research, we have made the code openly available at https://github.com/swag2198/microbiome-symbolic-regression .

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Visualization of an SRf expression learned using gplearn. The added custom function ifelse helped to reduce its size down to 15 nodes while the SR (without custom functions) expression for the same run had 25 nodes (figure in Appendix \ref{['appdx:sr-vs-srf']}).
  • Figure 2: Visualization of the SRf expression obtained by fitting it on predictions by XG. The expression is a composition of absence & absence_both functions which ultimately implies the presence of atleast one of the two identified bacteria in CRC patients. This simple expression gives the same prediction as XG (with $50$ trees) to over $80$% of cases while being straightforward to understand.
  • Figure 3: (Left) We see an SRf expression learned using gplearn. The added custom function ifelse helps to reduce its size down to 15 nodes while the SR expression for the same run had 25 nodes (Right).
  • Figure 4: Mean and standard deviation values for selected species, derived from the feature importance analysis using the symbolic regression model, indicate that for CRC, most identified bacteria exhibit higher relative abundance.