Table of Contents
Fetching ...

Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction

Xiaobo Xia, Xiaofeng Liu, Jiale Liu, Kuai Fang, Lu Lu, Samet Oymak, William S. Currie, Tongliang Liu

TL;DR

This work presents a multi-dimensional, quantitative evaluation of trustworthiness benchmarking three state-of-the-art deep learning architectures: recurrent (LSTM), operator-learning (DeepONet), and transformer-based (Informer), trained on 37 years of data from 482 U.S. basins to predict 20 water quality variables.

Abstract

Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning offers transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes operational decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges, including performance disparity, robustness, uncertainty, interpretability, generalizability, and reproducibility. In this work, we present a multi-dimensional, quantitative evaluation of trustworthiness benchmarking three state-of-the-art deep learning architectures: recurrent (LSTM), operator-learning (DeepONet), and transformer-based (Informer), trained on 37 years of data from 482 U.S. basins to predict 20 water quality variables. Our investigation reveals systematic performance disparities tied to process complexity, data availability, and basin heterogeneity. Management-critical variables remain the least predictable and most uncertain. Robustness tests reveal pronounced sensitivity to outliers and corrupted targets; notably, the architecture with the strongest baseline performance (LSTM) proves most vulnerable under data corruption. Attribution analyses align for simple variables but diverge for nutrients, underscoring the need for multi-method interpretability. Spatial generalization to ungauged basins remains poor across all models. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.

Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction

TL;DR

This work presents a multi-dimensional, quantitative evaluation of trustworthiness benchmarking three state-of-the-art deep learning architectures: recurrent (LSTM), operator-learning (DeepONet), and transformer-based (Informer), trained on 37 years of data from 482 U.S. basins to predict 20 water quality variables.

Abstract

Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning offers transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes operational decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges, including performance disparity, robustness, uncertainty, interpretability, generalizability, and reproducibility. In this work, we present a multi-dimensional, quantitative evaluation of trustworthiness benchmarking three state-of-the-art deep learning architectures: recurrent (LSTM), operator-learning (DeepONet), and transformer-based (Informer), trained on 37 years of data from 482 U.S. basins to predict 20 water quality variables. Our investigation reveals systematic performance disparities tied to process complexity, data availability, and basin heterogeneity. Management-critical variables remain the least predictable and most uncertain. Robustness tests reveal pronounced sensitivity to outliers and corrupted targets; notably, the architecture with the strongest baseline performance (LSTM) proves most vulnerable under data corruption. Attribution analyses align for simple variables but diverge for nutrients, underscoring the need for multi-method interpretability. Spatial generalization to ungauged basins remains poor across all models. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.

Paper Structure

This paper contains 25 sections, 1 equation, 25 figures, 2 tables.

Figures (25)

  • Figure 1: Multi-task deep learning models performance for water quality predictions across the continental United States (CONUS). (A) Boxplot of Kling-Gupta Efficiency (KGE) for the testing periods (1985, 1990, 1995, 2000, 2005, 2010, and 2015) across 20 predicted water quality variables associated with physical/chemical properties, geochemical weathering processes, and nutrient cycling. Boxes show the median (central line), the interquartile range (IQR; Q1-Q3), and whiskers extending to $\text{Q1}-1.5\times\text{IQR}$ and $\text{Q3} + 1.5\times\text{IQR}$. Wilcoxon signed-rank p-values ($^{***}p < 0.001$, $^{**}p < 0.01$, $^{*}p < 0.05$, and "ns" $p\geq 0.05$) were adjusted using Benjamini-Hochberg false discovery rate (FDR). (B) Locations of 482 riverine monitoring sites used in this study. (C) Example time series of total phosphorus (TP) showing model predictions, training/testing samples, and additional daily observations (not used in training or testing) collected by the National Center for Water Quality Research (NCWQR) at the Maumee River in Waterville, OH (orange circle in panel (B)) during 2008. (D-F) Scatter plots comparing predicted TP concentrations from three deep learning models with independent NCWQR observations, with PBIAS indicating percentage bias (observation minus simulation).
  • Figure 2: Relationships between model performance, simplicity, and training sample size across 20 water quality variables for LSTM (A), DeepONet (B), and Informer (C), and basin-level relations for LSTM (D). In panels (A-B), each dot represents one variable. Model performance is represented by the median KGE across CONUS, while the simplicity index measures the proportion of variance in water quality explained by linear relationships with runoff and annual cycles fang2024modeling. Both the size and color of each dot indicate the number of training samples, with larger, yellow
  • Figure 3: Robustness of three deep learning models under dataset corruptions. (A-I) Scatterplots of the median percent change in KGE relative to the uncorrupted baseline for each model (columns) and data corruption types (rows). Blue dots represent corruptions applied to input features (x) and red dots represent corruptions applied on targets (y). The fitted line shows the Pearson correlation between baseline median KGE and percent change (shaded 95% CI), reflecting how model vulnerability relates to baseline performance. (J-L) Aggregate robustness curves plotting the median percent change in KGE (across all basins and variables) versus the proportion of the dataset corrupted. The Theil-Sen median station-level slope $\beta$ is used to quantify the model performance degradation rate and is interpreted as the expected percent change in KGE per 0.1 (10% of the dataset) increase in corruption.
  • Figure 4: Model prediction uncertainty across water quality variables and its relationship with baseline performance, variable simplicity, and linearity. (A) For each water quality variable and deep learning model, prediction uncertainty is quantified as the standard deviation (SD) of the Kling-Gupta Efficiency (KGE) over 50 test-time augmentation (TTA) runs (see Methods). Boxplots show the median (central line), interquartile range (IQR, represented by the boxes spanning the first (Q1) to the third quartile (Q3)), and whiskers extending to $\text{Q1}-1.5\times\text{IQR}$ and $\text{Q3} + 1.5\times\text{IQR}$. (B-D) For each model (column), per-variable median uncertainty across all basins versus the baseline median KGE. (E-G) As in (B-D), but versus per-variable median simplicity. (H-J) As in (B-D), but versus per-variable median linearity. In (B-J), dashed lines are least-squares fits with 95% CI; Pearson’s r and corresponding p-values are reported in each panel.
  • Figure 5: Group-level feature importance across water quality variables, deep learning models, and attribution methods. In each panel, the top and bottom arcs list 20 water quality variables and feature groups: meteorological forcings (M), runoff (Q), rainfall chemistry (RC), vegetation indices (V), and basin attributes (BA) (full group composition in Methods). The ribbon width is normalized for each variable so widths to all groups sum to 1, representing the variable’s fractional attribution (comparable across groups for a given variable, but not across different variables). For Ablation (A, D, G), group importance is the percent decrease in Kling-Gupta Efficiency (KGE) when that group is removed from the full model. For Traverse (B, E, H), group importance is the average percent KGE reduction across all model variants with and without the target group (approximating its marginal contribution across subsets). For IG (Integrated Gradients, C, F, I), attributions are computed for each sample; the feature importance is the mean absolute IG over samples, and the group importance is the mean of feature-level $|$IG$|$ within that group.
  • ...and 20 more figures