Tackling water table depth modeling via machine learning: From proxy observations to verifiability
Joseph Janssen, Ardalan Tootchi, Ali A. Ameli
TL;DR
This paper tackles large-scale, static water table depth (WTD) estimation by combining physically constrained machine learning (ML) with proxy observations to create three 500 m WTD maps for the USA and Canada. It compares three ML setups—V1 (real WTD only), V2 (real plus shoreline proxy WOP>75%), and V3 (adds HAND-derived proxies)—against two PB simulations, evaluating performance across ten North American ecoregions using unseen real and proxy data. Results show ML models generally outperform PB in correlating with observed WTD (Corr-OBS in the range $0.6$–$0.75$) and, in particular, V2 excels at predicting interior wet areas, while V3 captures mountainous variability by leveraging topographic controls like the Topographic Index. The study highlights the pervasive data biases and uncertainties in WTD observations, the risk of model equifinality, and emphasizes future directions toward integrating physical laws, enhancing verification standards, and developing richer proxy data to improve verifiability and realism of large-scale WTD predictions.
Abstract
Spatial patterns of water table depth (WTD) play a crucial role in shaping ecological resilience, hydrological connectivity, and human-centric systems. Generally, a large-scale (e.g., continental or global) continuous map of static WTD can be simulated using either physically-based (PB) or machine learning-based (ML) models. We construct three fine-resolution (500 m) ML simulations of WTD, using the XGBoost algorithm and more than 20 million real and proxy observations of WTD, across the United States and Canada. The three ML models were constrained using known physical relations between WTD's drivers and WTD and were trained by sequentially adding real and proxy observations of WTD. Through an extensive (pixel-by-pixel) evaluation across the study region and within ten major ecoregions of North America, we demonstrate that our models (corr=0.6-0.75) can more accurately predict unseen real and proxy observations of WTD compared to two available PB simulations of WTD (corr=0.21-0.40). However, we still argue that currently-available large-scale simulations of static WTD could be uncertain within data-scarce regions such as steep mountainous regions. We reason that biased observational data mainly collected from low-elevation floodplains and the over-flexibility of available models can negatively affect the verifiability of large-scale simulations of WTD. Ultimately, we thoroughly discuss future directions that may help hydrogeologists decide how to improve machine learning-based WTD estimations. In particular, we advocate for the use of proxy satellite data, the incorporation of physical laws, the implementation of better model verification standards, the development of novel globally-available emergent indices, and the collection of more reliable observations.
