Has the Deep Neural Network learned the Stochastic Process? An Evaluation Viewpoint
Harshit Kumar, Beomseok Kang, Biswadeep Chakraborty, Saibal Mukhopadhyay
TL;DR
This work reframes evaluation of DNNs forecasting stochastic complex systems by introducing Fidelity to Stochastic Process (F2SP) and Statistic-GT as targets that reflect the system’s underlying stochastic dynamics rather than a single observed realization. It proves that Expected Calibration Error (ECE) uniquely tests F2SP using only the Observed-GT, formalizes a stochastic-process framework with micro/macro RVs, and demonstrates, through synthetic forests, host-pathogen, and stock-market simulations, that calibration-based measures reveal learning of the stochastic process where traditional metrics fail. A real-world wildfire case corroborates the synthetic findings and highlights practical framework integration to resolve metric rank conflicts. The work advocates a dual-evaluation paradigm—F2SP via ECE and F2R via discriminative metrics—to reliably assess DNNs for stochastic, high-dimensional forecasting tasks with real-world impact.
Abstract
This paper presents the first systematic study of evaluating Deep Neural Networks (DNNs) designed to forecast the evolution of stochastic complex systems. We show that traditional evaluation methods like threshold-based classification metrics and error-based scoring rules assess a DNN's ability to replicate the observed ground truth but fail to measure the DNN's learning of the underlying stochastic process. To address this gap, we propose a new evaluation criterion called Fidelity to Stochastic Process (F2SP), representing the DNN's ability to predict the system property Statistic-GT--the ground truth of the stochastic process--and introduce an evaluation metric that exclusively assesses F2SP. We formalize F2SP within a stochastic framework and establish criteria for validly measuring it. We formally show that Expected Calibration Error (ECE) satisfies the necessary condition for testing F2SP, unlike traditional evaluation methods. Empirical experiments on synthetic datasets, including wildfire, host-pathogen, and stock market models, demonstrate that ECE uniquely captures F2SP. We further extend our study to real-world wildfire data, highlighting the limitations of conventional evaluation and discuss the practical utility of incorporating F2SP into model assessment. This work offers a new perspective on evaluating DNNs modeling complex systems by emphasizing the importance of capturing the underlying stochastic process.
