Recurrent Neural Goodness-of-Fit Test for Time Series
Aoran Zhang, Wenbin Zhou, Liyan Xie, Shixiang Zhu
TL;DR
RENAL introduces a statistically principled GOF test for generative time series by transforming dependent sequences into conditionally independent history embeddings via recurrent networks. The core idea is to model the history through a low-dimensional Markov transition density $Q$ and compare the native and generated processes with a chi-square test on discretized transitions, after adaptive binning of the embedding space. The authors establish the asymptotic distribution $W_m \sim \chi^2_{m(m-1)}$ under $H_0$ and provide a practical algorithm that jointly optimizes binning and transition-discrepancy to maximize testing power. Empirically, RENAL demonstrates superior Type-I and Type-II accuracy across synthetic data (time series, TPP, and STPP) and real-world datasets (earthquakes and weather) compared to a broad set of baselines, offering a robust, scalable tool for evaluating time series generative models.
Abstract
Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.
