Has the Deep Neural Network learned the Stochastic Process? An Evaluation Viewpoint

Harshit Kumar; Beomseok Kang; Biswadeep Chakraborty; Saibal Mukhopadhyay

Has the Deep Neural Network learned the Stochastic Process? An Evaluation Viewpoint

Harshit Kumar, Beomseok Kang, Biswadeep Chakraborty, Saibal Mukhopadhyay

TL;DR

This work reframes evaluation of DNNs forecasting stochastic complex systems by introducing Fidelity to Stochastic Process (F2SP) and Statistic-GT as targets that reflect the system’s underlying stochastic dynamics rather than a single observed realization. It proves that Expected Calibration Error (ECE) uniquely tests F2SP using only the Observed-GT, formalizes a stochastic-process framework with micro/macro RVs, and demonstrates, through synthetic forests, host-pathogen, and stock-market simulations, that calibration-based measures reveal learning of the stochastic process where traditional metrics fail. A real-world wildfire case corroborates the synthetic findings and highlights practical framework integration to resolve metric rank conflicts. The work advocates a dual-evaluation paradigm—F2SP via ECE and F2R via discriminative metrics—to reliably assess DNNs for stochastic, high-dimensional forecasting tasks with real-world impact.

Abstract

This paper presents the first systematic study of evaluating Deep Neural Networks (DNNs) designed to forecast the evolution of stochastic complex systems. We show that traditional evaluation methods like threshold-based classification metrics and error-based scoring rules assess a DNN's ability to replicate the observed ground truth but fail to measure the DNN's learning of the underlying stochastic process. To address this gap, we propose a new evaluation criterion called Fidelity to Stochastic Process (F2SP), representing the DNN's ability to predict the system property Statistic-GT--the ground truth of the stochastic process--and introduce an evaluation metric that exclusively assesses F2SP. We formalize F2SP within a stochastic framework and establish criteria for validly measuring it. We formally show that Expected Calibration Error (ECE) satisfies the necessary condition for testing F2SP, unlike traditional evaluation methods. Empirical experiments on synthetic datasets, including wildfire, host-pathogen, and stock market models, demonstrate that ECE uniquely captures F2SP. We further extend our study to real-world wildfire data, highlighting the limitations of conventional evaluation and discuss the practical utility of incorporating F2SP into model assessment. This work offers a new perspective on evaluating DNNs modeling complex systems by emphasizing the importance of capturing the underlying stochastic process.

Has the Deep Neural Network learned the Stochastic Process? An Evaluation Viewpoint

TL;DR

Abstract

Paper Structure (50 sections, 6 equations, 25 figures, 8 tables)

This paper contains 50 sections, 6 equations, 25 figures, 8 tables.

Introduction
Background and Dataset
Formulation of DNN Prediction and Evaluation for Complex Systems.
Stochasticity in Complex Systems and Challenges in Stochastic Modeling
Synthetic Complex Systems and Their Stochastic Simulation
Evaluating a Complex Stochastic Process
Simulating a Complex Stochastic Process
Formulating a Complex Stochastic Process
Why is Statistic-GT a Property of Interest: Limitations of F2R
How do we test Fidelity to Statistic-GT?
Expected Calibration Error Tests Fidelity to Statistic-GT
Baseline Evaluation Metrics
Benchmark Experiments
Experimental Setup
Testing ECE's ability to assess DNN's fidelity to Statistic-GT
...and 35 more sections

Figures (25)

Figure 1: (a) The figure depicts the evolution of a stochastic process in a forest fire model. Starting from same initial conditions, diverse outcomes emerge over the prediction horizon, depicted by the shaded red region. The Observed-GT $\{b_{t,(i,j)}\}^{H \times W}$ represents one outcome on a $H \times W$ grid, while the Statistic-GT$\{p_{t,(i,j)}\}^{H \times W}$ shows the normalized frequency of target state occurrences, capturing the full stochastic process (§ \ref{['sec:proposed']}). (b) This panel illustrates the proposed evaluation framework: F2R evaluates alignment with Observed-GT (using AUC-PR), F2SP tests alignment with Statistic-GT (using ECE), and MSE balances both criteria. The framework provides a unified approach for interpreting model performance in stochastic settings. See § \ref{['app:practical_guide']} for a practical guide on the framework.
Figure 2: (a) The first four rows display four distinct MC simulations of forest-fire evolution from the same initial condition. The last row shows the Statistic-GT representing the evolution of the Stochastic Process (formally defined in § \ref{['ssec:micro_and_macro_rv']})). For qualitative examples of MC simulations from other complex systems, please refer to § \ref{['app:other_complex_systems_description']}.; (b) Table highlights the sources of randomness in the synthetic dataset across different simulation strategies used in this study. Deterministic processes randomize initial conditions but follow fixed fire evolution rules ($S\text{-Level} = 0$). Stochastic processes ($S\text{-Level} > 0$) allow multiple evolutionary paths: the ESP fixes initial conditions, while stochastic train/test setups (used in benchmark experiments, § \ref{['sec:benchmark_experiments']}) randomize initial conditions across simulations.
Figure 3: Performance of DNNs trained on one S-Level and tested on another, evaluated using three evaluation metrics: (a) $(1 - \text{AUC-PR})\downarrow$, (b) $\text{MSE}\downarrow$, and (c) $\text{ECE}\downarrow$ across three complex systems. Top Row: Forest Fire, Middle Row: Host-Pathogen, Bottom Row: Stock Market. The $x$-axis (Test S-Level) and $y$-axis (Train S-Level) are consistent across all matrices. This figure highlights ECE’s unique ability to evaluate whether the DNN has learned the correct stochastic process. While AUC-PR and MSE exhibit performance degradation as the difference between train and test S-Levels increases, ECE shows a distinct pattern. Specifically, for cases where the train and test S-Levels match, ECE indicates little to no performance degradation, underscoring its unique capability to test the F2SP evaluation criteria. Theoretical insights into these results are detailed in §\ref{['ssec:ece_for_evaluation']} and §\ref{['ssec:baseline_evaluation']}.
Figure 4: Two DNNs were trained on 700 forest fire simulations with different S-Levels—10 (orange, low stochasticity) and 20 (blue, high stochasticity)—and evaluated on 300 test simulations with S-Level 20. Evaluation metrics include (a) AUC-PR, (b) MSE, and (c) ECE, measured over an extended prediction horizon. AUC-PR shows similar trends for both models, failing to distinguish the stochastic mismatch. MSE exhibits a steeper decline for the mismatch case but also degrades for both models. In contrast, ECE remains low and stable for the DNN trained on S-Level 20, highlighting its unique ability to track alignment with the Statistic-GT. This behavior also highlights the potential stability in long-horizon predictions when tracking Statistic-GT, as it represents a single underlying property of the system, unlike Observed-GT, which varies across multiple possible outcomes.
Figure 5: DNN takes as input a fire mask (left column) and 11 observational variables to predict the next-day fire mask (right column), compared to Observed-GT (middle column). F2R metrics indicate suboptimal performance.
...and 20 more figures

Has the Deep Neural Network learned the Stochastic Process? An Evaluation Viewpoint

TL;DR

Abstract

Has the Deep Neural Network learned the Stochastic Process? An Evaluation Viewpoint

Authors

TL;DR

Abstract

Table of Contents

Figures (25)