Knowledge-guided machine learning for county-level corn yield prediction under drought
Xiaoyu Wang, Yijia Xu, Jingyi Huang, Zhengwei Yang, Yanbo Huang, Rajat Bindlish, Zhou Zhang
TL;DR
KGML-SM addresses county-level corn yield prediction under drought by embedding soil moisture as an explicit intermediate variable and combining process-based simulations with ML via a Weather-to-Soil (W2S) encoder and an attention-based predictor. It introduces a drought-aware loss that mitigates overestimation in water-limited conditions and weights errors by soil moisture, enhancing robustness. A Weather-to-Soil encoder based on a U-Net maps weather time series to soil moisture, while an attention fusion weighs heterogeneous features to predict yield. Pretraining on APSIM field-level simulations and finetuning on Google Earth Engine county-level data yield robust, interpretable county-scale predictions with diagnostics linking soil moisture dynamics to yield errors.
Abstract
Remote sensing (RS) technique, enabling the non-contact acquisition of extensive ground observations, is a valuable tool for crop yield predictions. Traditional process-based models struggle to incorporate large volumes of RS data, and most users lack understanding of crop growth mechanisms. In contrast, machine learning (ML) models are often criticized as "black boxes" due to their limited interpretability. To address these limitations, we utilized Knowledge-Guided Machine Learning (KGML), a framework that leverages the strengths of both process-based and ML models. Existing works have either overlooked the role of soil moisture in corn growth or did not embed this effect into their models. To bridge this gap, we developed the Knowledge-Guided Machine Learning with Soil Moisture (KGML-SM) framework, treating soil moisture as an intermediate variable in corn growth to emphasize its key role in plant development. Additionally, based on the prior knowledge that the model may overestimate under drought conditions, we designed a drought-aware loss function that penalized predicted yield in drought-affected areas. Our experiments showed that the KGML-SM model outperformed other traditional ML models. We explored the relationships between drought, soil moisture, and corn yield prediction by assessing the importance of different features within the model, and analyzing how soil moisture impacts predictions across different regions and time periods. Finally we provided interpretability for prediction errors to guide future model optimization.
