Table of Contents
Fetching ...

Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via Mixed-Effect Models and Hierarchical Clustering

Lorenzo Valleggi, Marco Scutari, Federico Mattia Stefanini

TL;DR

This work tackles heterogeneity and causal inference in agronomic data by introducing a mixed-effects Bayesian network (BN) that embeds random effects into local distributions and leverages hierarchical clustering to manage site-variety diversity. The authors develop a structure-learning procedure that combines residual-based clustering (60 clusters) with a linear mixed-effects formulation, yielding a BN that outperforms a baseline conditional Gaussian BN on grain yield prediction and imputation. The results reveal new causal connections (e.g., involving plant height, silking, and temperature/humidity windows) and demonstrate a meaningful reduction in predictive error from $MAPE$ around 28% to approximately 17%, supporting the BN as a practical decision-support tool for hierarchical agronomic data. The approach addresses hierarchical exchangeability and provides a framework for causal reasoning in complex, heterogeneous datasets, with potential extensions to spatial modeling and more granular clustering for future work.

Abstract

Maize, a crucial crop globally cultivated across vast regions, especially in sub-Saharan Africa, Asia, and Latin America, occupies 197 million hectares as of 2021. Various statistical and machine learning models, including mixed-effect models, random coefficients models, random forests, and deep learning architectures, have been devised to predict maize yield. These models consider factors such as genotype, environment, genotype-environment interaction, and field management. However, the existing models often fall short of fully exploiting the complex network of causal relationships among these factors and the hierarchical structure inherent in agronomic data. This study introduces an innovative approach integrating random effects into Bayesian networks (BNs), leveraging their capacity to model causal and probabilistic relationships through directed acyclic graphs. Rooted in the linear mixed-effects models framework and tailored for hierarchical data, this novel approach demonstrates enhanced BN learning. Application to a real-world agronomic trial produces a model with improved interpretability, unveiling new causal connections. Notably, the proposed method significantly reduces the error rate in maize yield prediction from 28% to 17%. These results advocate for the preference of BNs in constructing practical decision support tools for hierarchical agronomic data, facilitating causal inference.

Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via Mixed-Effect Models and Hierarchical Clustering

TL;DR

This work tackles heterogeneity and causal inference in agronomic data by introducing a mixed-effects Bayesian network (BN) that embeds random effects into local distributions and leverages hierarchical clustering to manage site-variety diversity. The authors develop a structure-learning procedure that combines residual-based clustering (60 clusters) with a linear mixed-effects formulation, yielding a BN that outperforms a baseline conditional Gaussian BN on grain yield prediction and imputation. The results reveal new causal connections (e.g., involving plant height, silking, and temperature/humidity windows) and demonstrate a meaningful reduction in predictive error from around 28% to approximately 17%, supporting the BN as a practical decision-support tool for hierarchical agronomic data. The approach addresses hierarchical exchangeability and provides a framework for causal reasoning in complex, heterogeneous datasets, with potential extensions to spatial modeling and more granular clustering for future work.

Abstract

Maize, a crucial crop globally cultivated across vast regions, especially in sub-Saharan Africa, Asia, and Latin America, occupies 197 million hectares as of 2021. Various statistical and machine learning models, including mixed-effect models, random coefficients models, random forests, and deep learning architectures, have been devised to predict maize yield. These models consider factors such as genotype, environment, genotype-environment interaction, and field management. However, the existing models often fall short of fully exploiting the complex network of causal relationships among these factors and the hierarchical structure inherent in agronomic data. This study introduces an innovative approach integrating random effects into Bayesian networks (BNs), leveraging their capacity to model causal and probabilistic relationships through directed acyclic graphs. Rooted in the linear mixed-effects models framework and tailored for hierarchical data, this novel approach demonstrates enhanced BN learning. Application to a real-world agronomic trial produces a model with improved interpretability, unveiling new causal connections. Notably, the proposed method significantly reduces the error rate in maize yield prediction from 28% to 17%. These results advocate for the preference of BNs in constructing practical decision support tools for hierarchical agronomic data, facilitating causal inference.
Paper Structure (11 sections, 5 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 11 sections, 5 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Prediction accuracy of the learned BNs, $\mathcal{B}_{\mathit{LME}}$ (blue line) and $\mathcal{B}_{\mathit{CGBN}}$ (orange line), in terms of grain yield Mean Absolute Percentage Error (MAPE) of each scenario of evidence propagation (definitions of the scenarios are reported in Table \ref{['tab:scenarios']}). Lower values are better.
  • Figure 2: Imputation accuracy of the learned BNs, $\mathcal{B}_{\mathit{LME}}$ (blue points) and $\mathcal{B}_{\mathit{CGBN}}$ (red points), in terms of grain yield Mean Absolute Percentage Error (MAPE) of each site-variety combination, shown sequentially for brevity. Lower values are better.
  • Figure 3: Kernel densities of the grain yield in the training set are represented by the solid curve, while the dashed curve depicts the kernel densities of the predicted grain yield obtained through likelihood-weighted approximation during cross-validation. The kernel density-based credible interval at 80% for the grain yield in the training set is indicated by the red line and for the predicted grain yield by the blue line. The mean is reported with a solid line for the grain yield of the training set and a dashed line for the predicted grain yield.