Table of Contents
Fetching ...

Recurrent Neural Goodness-of-Fit Test for Time Series

Aoran Zhang, Wenbin Zhou, Liyan Xie, Shixiang Zhu

TL;DR

RENAL introduces a statistically principled GOF test for generative time series by transforming dependent sequences into conditionally independent history embeddings via recurrent networks. The core idea is to model the history through a low-dimensional Markov transition density $Q$ and compare the native and generated processes with a chi-square test on discretized transitions, after adaptive binning of the embedding space. The authors establish the asymptotic distribution $W_m \sim \chi^2_{m(m-1)}$ under $H_0$ and provide a practical algorithm that jointly optimizes binning and transition-discrepancy to maximize testing power. Empirically, RENAL demonstrates superior Type-I and Type-II accuracy across synthetic data (time series, TPP, and STPP) and real-world datasets (earthquakes and weather) compared to a broad set of baselines, offering a robust, scalable tool for evaluating time series generative models.

Abstract

Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.

Recurrent Neural Goodness-of-Fit Test for Time Series

TL;DR

RENAL introduces a statistically principled GOF test for generative time series by transforming dependent sequences into conditionally independent history embeddings via recurrent networks. The core idea is to model the history through a low-dimensional Markov transition density and compare the native and generated processes with a chi-square test on discretized transitions, after adaptive binning of the embedding space. The authors establish the asymptotic distribution under and provide a practical algorithm that jointly optimizes binning and transition-discrepancy to maximize testing power. Empirically, RENAL demonstrates superior Type-I and Type-II accuracy across synthetic data (time series, TPP, and STPP) and real-world datasets (earthquakes and weather) compared to a broad set of baselines, offering a robust, scalable tool for evaluating time series generative models.

Abstract

Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.

Paper Structure

This paper contains 39 sections, 6 theorems, 41 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

The history embedding sequence $\{h_i\}_{i=1}^n$ is a homogeneous Markov chain, i.e., for any set $B \subset \mathcal{H}$: and The proof is provided in Appendix app:proof-markov.

Figures (4)

  • Figure 1: An illustration of our problem setup. In traditional test problems, both the real data ($D_0$) and the model-generated data ($D_1$) are assumed to be i.i.d., following the underlying distributions $\mathbb P^\star$ and $\widehat{\mathbb P}$, respectively. However, in our test problem, both $D_0$ and $D_1$ exhibit general temporal dependencies.
  • Figure 2: Architecture of the proposed framework. Real-world observations are compared to model-generated sequences, with darker blue indicating better fits. We first use a recurrent neural network $\phi$ to extract conditionally independent history embeddings. Then we construct their transition probability matrices using these embeddings and evaluate the fit with a chi-square discrepancy test.
  • Figure 3: Transition probability matrices of history embeddings $Q$ from (a) real data, (b) data generated by Model $1$, and (c) data generated by Model $2$. Model $1$ exhibits a better fit compared to Model $2$, as evidenced by the closer resemblance between the histograms in (a) and (b). The number in the parentheses indicates the corresponding testing score.
  • Figure 4: Ablation study of hyper-parameter $\lambda$ selection.

Theorems & Definitions (8)

  • Lemma 1
  • Proposition 1
  • Lemma 2
  • Proposition 2
  • Lemma 3: Lemma 3.2 in billingsley1961statistical
  • proof
  • Lemma 4: Theorem 3.1 from billingsley1961statistical
  • proof