Table of Contents
Fetching ...

Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data

Rosa Aghdam, Xudong Tang, Shan Shan, Richard Lankau, Claudia Solís-Lemus

TL;DR

This study evaluates the predictive potential of machine learning to link soil microbiome and environmental properties to potato plant phenotypes, using Random Forest and Bayesian Neural Networks. It demonstrates that accurate human labels are crucial for predictive success, with disease like pitted scab being forecastable from microbiome data, while yield prediction is hampered by label quality and binarization. The analysis reveals that data preprocessing choices and feature-selection strategies strongly influence performance, and provides a full model selection decision tree to guide practitioners. Importantly, environmental soil factors often provide strong signals on their own, and the best predictive power typically arises from integrating microbiome with environmental data, informing cost-effective data collection strategies for soil health and crop outcome prediction.

Abstract

The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.

Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data

TL;DR

This study evaluates the predictive potential of machine learning to link soil microbiome and environmental properties to potato plant phenotypes, using Random Forest and Bayesian Neural Networks. It demonstrates that accurate human labels are crucial for predictive success, with disease like pitted scab being forecastable from microbiome data, while yield prediction is hampered by label quality and binarization. The analysis reveals that data preprocessing choices and feature-selection strategies strongly influence performance, and provides a full model selection decision tree to guide practitioners. Importantly, environmental soil factors often provide strong signals on their own, and the best predictive power typically arises from integrating microbiome with environmental data, informing cost-effective data collection strategies for soil health and crop outcome prediction.

Abstract

The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.
Paper Structure (7 sections, 1 equation, 47 figures, 8 tables)

This paper contains 7 sections, 1 equation, 47 figures, 8 tables.

Figures (47)

  • Figure 1: Workflow of the analyses with three main steps: (1) Data Preparation, (2) Feature Selection, and (3) Classification. In (1) Data Preparation, we consider OTUs (number of OTUs = N) at different taxonomic levels and filter by sample size (n). In addition, we perform conbination of five normalization methods and four zero replacement methods (for a total of 20 normalized datasets). In (2) Feature selection, we rank OTUs based on i) the number of times they are selected as important features by machine learning (ML) criteria, and ii) the greatest degree of difference on microbial networks reconstructed from samples of each class. We score OTUs based on whether they are selected by ML ($score=1$), by network comparison ($score=2$), both ($score=3$) or neither ($score=0$). In (3) Classification, the Venn diagram depicts the different types of predictors: microbiome (OTUs), environmental (Env), and the combination of both. The acronyms (e.g., All-OTU or OTU-S3+DS) correspond to different choices of predictors that are described in Table \ref{['tableNameMethods']}). Random forest and Bayesian neural network classification models are fitted on the different input predictors.
  • Figure 1: Flowchart for binarizing the continuous yield response into binary labels for the Russet variety. The same procedure is used for every variety in the dataset.
  • Figure 2: The most accurate predictions across all outcomes are achieved using alpha diversity and soil chemistry data (Alpha+Soil) for the RF model, whereas for the Bayesian NN models, optimal performance is observed when utilizing OTUs identified as important by both machine learning and network comparison strategies, in conjunction with soil chemistry data (OTU-S3+Soil). For a detailed presentation of all results, please refer to Supplementary Figures S\ref{['4-ALL-RF']} (RF) and S\ref{['5-ALL-BNN']} (Bayesian NN).
  • Figure 2: Flowchart with the data augmentation algorithm. The target sample size for the training set is 800 with 400 samples for each label. The noise that we artificially generate needs to be variety-specific before adding to the original samples so that the biological implications of the original samples would be preserved.
  • Figure 3: Weighted F1 scores (y-axis) for random forest and Bayesian neural network (Bayesian NN) models for the pitted scab disease under the 20 normalization/zero replacement strategies (x-axis). The lack of pattern prevents us from making recommendations of optimal strategies for microbiome data. We can conclude, however, that taxonomic levels, normalization and zero replacement strategies have an effect on the prediction accuracy of the models as evidenced by the broad range displayed by the points.
  • ...and 42 more figures