Table of Contents
Fetching ...

Feasibility of machine learning-based rice yield prediction in India at the district level using climate reanalysis data

Djavan De Clercq, Adam Mahdi

TL;DR

It is demonstrated that rice yields can be predicted with a reasonable degree of accuracy, with out-of-sample R2, MAE, and MAPE performance of up to 0.82, 0.29, and 0.16 respectively.

Abstract

Yield forecasting, the science of predicting agricultural productivity before the crop harvest occurs, helps a wide range of stakeholders make better decisions around agricultural planning. This study aims to investigate whether machine learning-based yield prediction models can capably predict Kharif season rice yields at the district level in India several months before the rice harvest takes place. The methodology involved training 19 machine learning models such as CatBoost, LightGBM, Orthogonal Matching Pursuit, and Extremely Randomized Trees on 20 years of climate, satellite, and rice yield data across 247 of Indian rice-producing districts. In addition to model-building, a dynamic dashboard was built understand how the reliability of rice yield predictions varies across districts. The results of the proof-of-concept machine learning pipeline demonstrated that rice yields can be predicted with a reasonable degree of accuracy, with out-of-sample R2, MAE, and MAPE performance of up to 0.82, 0.29, and 0.16 respectively. These results outperformed test set performance reported in related literature on rice yield modeling in other contexts and countries. In addition, SHAP value analysis was conducted to infer both the importance and directional impact of the climate and remote sensing variables included in the model. Important features driving rice yields included temperature, soil water volume, and leaf area index. In particular, higher temperatures in August correlate with increased rice yields, particularly when the leaf area index in August is also high. Building on the results, a proof-of-concept dashboard was developed to allow users to easily explore which districts may experience a rise or fall in yield relative to the previous year.

Feasibility of machine learning-based rice yield prediction in India at the district level using climate reanalysis data

TL;DR

It is demonstrated that rice yields can be predicted with a reasonable degree of accuracy, with out-of-sample R2, MAE, and MAPE performance of up to 0.82, 0.29, and 0.16 respectively.

Abstract

Yield forecasting, the science of predicting agricultural productivity before the crop harvest occurs, helps a wide range of stakeholders make better decisions around agricultural planning. This study aims to investigate whether machine learning-based yield prediction models can capably predict Kharif season rice yields at the district level in India several months before the rice harvest takes place. The methodology involved training 19 machine learning models such as CatBoost, LightGBM, Orthogonal Matching Pursuit, and Extremely Randomized Trees on 20 years of climate, satellite, and rice yield data across 247 of Indian rice-producing districts. In addition to model-building, a dynamic dashboard was built understand how the reliability of rice yield predictions varies across districts. The results of the proof-of-concept machine learning pipeline demonstrated that rice yields can be predicted with a reasonable degree of accuracy, with out-of-sample R2, MAE, and MAPE performance of up to 0.82, 0.29, and 0.16 respectively. These results outperformed test set performance reported in related literature on rice yield modeling in other contexts and countries. In addition, SHAP value analysis was conducted to infer both the importance and directional impact of the climate and remote sensing variables included in the model. Important features driving rice yields included temperature, soil water volume, and leaf area index. In particular, higher temperatures in August correlate with increased rice yields, particularly when the leaf area index in August is also high. Building on the results, a proof-of-concept dashboard was developed to allow users to easily explore which districts may experience a rise or fall in yield relative to the previous year.
Paper Structure (17 sections, 1 equation, 5 figures, 2 tables)

This paper contains 17 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the geospatial data used in this research. The panels show, respectively: cultivated rice area in India; snapshots of temperature, total precipitation, potential evaporation, surface pressure, soil moisture, and leaf area index from the ECMFW (average values of January 2022); NDVI data from NASA’s MODIS (average values of January 2022); and district-level rice yield data from India’s Ministry of Agriculture and Farmers Welfare (in 2020).
  • Figure 2: Comparative evaluation of selected models, including Random Forest, CatBoost, and LightGBM regressors using prediction error and residuals analysis. Two years of observations (502 observations in total) where used for the out-of-sample validation data, on which the Random Forest, CatBoost, and LightGBM models have test R² values of 0.80, 0.82, and 0.80 respectively. Residuals are mostly centered around zero, but CatBoost shows a skewness in error distribution. The histogram of residuals indicates Random Forest and CatBoost have a tighter error distribution compared to LightGBM's broader range.
  • Figure 3: Interpretation of SHAP values for selected features in the rice yield prediction random forest model. The SHAP feature importance plot (left panel) exhibits the impact of various features on the model's output. Higher SHAP values indicate a greater influence on the predicted yield. The coloring on the feature importance plot represents the value of the feature for each data point. Blue points indicate low feature values, while pink points represent high feature values. This color gradient allows us to visualize not only the impact (magnitude of the SHAP value) each feature has on the model output but also the distribution of the feature's values. For instance, when examining 'Temperature August', we can see a mix of pink and blue points across a range of SHAP values, indicating a diverse range of temperatures in August within the dataset and how these varying temperatures correlate with the rice yield prediction. The top right panel presents a SHAP dependence plot for temperature in August, illustrating a correlation between higher temperatures and increased SHAP values for rice yield. The intensity of the color indicates the interaction effect, with a notable interaction with LAI in August, as higher LAI values (depicted in red) intensify the impact of temperature on yield. The bottom right panel depicts a SHAP dependence plot for soil water volume (SWVL1) in August, showing the relationship between SWVL1 values and SHAP values. This plot reveals that certain values of SWVL1 are associated with lower or higher SHAP values, indicating its varying influence on yield predictions, with the color intensity representing the interaction with NDVI in May.
  • Figure 4: Interactive dashboard of yield prediction model outputs. The top panel shows a map of India with a colour coding applied to different districts to indicate the predicted yield values. Shades of blue indicate an increase in yield and shades of red denote a decrease in yield compared to the prior year's yield. Accompanying the map is a bar chart that provides a state-level summary and a table that enumerates the district-level predicted yields and the percentage change from the previous year across all states and districts. The bottom panel provides a similar comparative yield prediction, but focuses on the state of Uttar Pradesh.
  • Figure 5: Interactive dashboard showing spatial view of yield prediction model error. The top panel provide model diagnostic information. The visual shows a map with the average percentage error by region; a scatter plot comparing the predicted yield and the actual yield; and a line graph showing the actual yield and predicted yield each year. The lower panel shows a similar view, zoomed in to districts within the state of Rajasthan.