Table of Contents
Fetching ...

Variable Selection for Comparing High-dimensional Time-Series Data

Kensuke Mitsuzawa, Margherita Grossi, Stefano Bortoli, Motonobu Kanagawa

TL;DR

We address the problem of identifying when and where two high-dimensional time-series differ by jointly selecting time subintervals $[t_{b-1}+1,t_b]$ and active variables $d$ within each block. The proposed Time-Slicing Variable Selection framework is a meta-algorithm that splits the total interval into $B$ subintervals, performs subinterval-wise two-sample variable selection on training portions to yield $\hat{S}_b$, and uses a permutation test on held-out data to produce $p_b$, enabling interpretable difference localization from a single pair of series. The framework is agnostic to the choice of two-sample variable-selection method, and the paper demonstrates both MMD-based (eg, ARD-kernel with $L_1$ regularization) and marginal-distribution approaches, plus synthetic and real-data demonstrations including a DNN emulator validation for a particle-based fluid simulator and a microscopic traffic-simulation comparison. The results illustrate practical trade-offs in selecting the number of subintervals and show that the approach can provide actionable, region-specific diagnostics for simulator validation and model comparison, without requiring multiple realizations.

Abstract

Given a pair of multivariate time-series data of the same length and dimensions, an approach is proposed to select variables and time intervals where the two series are significantly different. In applications where one time series is an output from a computationally expensive simulator, the approach may be used for validating the simulator against real data, for comparing the outputs of two simulators, and for validating a machine learning-based emulator against the simulator. With the proposed approach, the entire time interval is split into multiple subintervals, and on each subinterval, the two sample sets are compared to select variables that distinguish their distributions and a two-sample test is performed. The validity and limitations of the proposed approach are investigated in synthetic data experiments. Its usefulness is demonstrated in an application with a particle-based fluid simulator, where a deep neural network model is compared against the simulator, and in an application with a microscopic traffic simulator, where the effects of changing the simulator's parameters on traffic flows are analysed.

Variable Selection for Comparing High-dimensional Time-Series Data

TL;DR

We address the problem of identifying when and where two high-dimensional time-series differ by jointly selecting time subintervals and active variables within each block. The proposed Time-Slicing Variable Selection framework is a meta-algorithm that splits the total interval into subintervals, performs subinterval-wise two-sample variable selection on training portions to yield , and uses a permutation test on held-out data to produce , enabling interpretable difference localization from a single pair of series. The framework is agnostic to the choice of two-sample variable-selection method, and the paper demonstrates both MMD-based (eg, ARD-kernel with regularization) and marginal-distribution approaches, plus synthetic and real-data demonstrations including a DNN emulator validation for a particle-based fluid simulator and a microscopic traffic-simulation comparison. The results illustrate practical trade-offs in selecting the number of subintervals and show that the approach can provide actionable, region-specific diagnostics for simulator validation and model comparison, without requiring multiple realizations.

Abstract

Given a pair of multivariate time-series data of the same length and dimensions, an approach is proposed to select variables and time intervals where the two series are significantly different. In applications where one time series is an output from a computationally expensive simulator, the approach may be used for validating the simulator against real data, for comparing the outputs of two simulators, and for validating a machine learning-based emulator against the simulator. With the proposed approach, the entire time interval is split into multiple subintervals, and on each subinterval, the two sample sets are compared to select variables that distinguish their distributions and a two-sample test is performed. The validity and limitations of the proposed approach are investigated in synthetic data experiments. Its usefulness is demonstrated in an application with a particle-based fluid simulator, where a deep neural network model is compared against the simulator, and in an application with a microscopic traffic simulator, where the effects of changing the simulator's parameters on traffic flows are analysed.

Paper Structure

This paper contains 26 sections, 15 equations, 21 figures, 2 algorithms.

Figures (21)

  • Figure 1: Illustration of the proposed approach to variable selection for two multivariate time series. Top row:$X = ({\bm x}_1, \dots, {\bm x}_T) \in \mathbb{R}^{D \times T}$ and $Y = ({\bm y}_1, \dots, {\bm y}_T) \in \mathbb{R}^{D \times T}$ represent two given multivariate time series data. Time-splitting is applied to each of $X$ and $Y$ at time points $t_1, t_2, \dots, t_{B-1}$. Middle row: On each subinterval $b = 1, 2, \dots, B$, two-sample variable selection is applied to compare the data matrices $X_b = ({\bm x}_{t_{b-1}+1}, \dots, {\bm x}_{ t_b })$ and $Y_b = ({\bm y}_{t_{b-1}+1}, \dots, {\bm y}_{ t_b })$. Bottom row: Selected variables $\hat{S}_b \subset \{1, \dots, D\}$ from the middle row are used to perform a permutation two-sample test to calculate a p-value $p_b$.
  • Figure 2: An example of time-splitting applied to two time-series data $X = ({\bm x}_1, \dots, {\bm x}_T) \in \mathbb{R}^{D \times T}$ and $Y = ({\bm y}_1, \dots, {\bm y}_T) \in \mathbb{R}^{D \times T}$ with $D = 5$ and $T = 130$, with $t_1 = 50$, $t_2 = 100$ and $t_3 = 130 = T$. The top plots show the two-time series $X$ and $Y$, where 5 different colours correspond to the 5 variables (or dimensions). In each plot, the horizontal axis represents time points, and the vertical axis the values of each variable. The two variables represented by the violet and light-blue colours follow different stochastic processes for $X$ and $Y$. The bottom figures show the results of applying the time-splitting and randomisation in each subinterval.
  • Figure 3: Illustration of two time-series data $X \in \mathbb{R}^{D \times T}$ and $Y \in \mathbb{R}^{D \times T}$ in Eq. \ref{['eq:data-setting-1-56']}. Each subfigure shows the trajectories of $X$ and $Y$ in each variable $d= 1,\dots,D$, i.e., $x_{1,d}, \dots, x_{T,d}$ and $y_{1,d}, \dots, y_{T,d}$. The variable $d=4$ from $t=251$ to $t = 500$ is where $X$ and $Y$ differ.
  • Figure 4: Illustration of two time-series data $X \in \mathbb{R}^{D \times T}$ and $Y \in \mathbb{R}^{D \times T}$ in Eq. \ref{['eq:data-setting-2-316']}. Each subfigure shows the trajectories of $X$ and $Y$ in each variable $d= 1,\dots,D$, i.e., $x_{1,d}, \dots, x_{T,d}$ and $y_{1,d}, \dots, y_{T,d}$. The variable $d=4$ from $t=251$ to $t = 500$ is where $X$ and $Y$ differ.
  • Figure 5: Setting 1 with $B =10$.
  • ...and 16 more figures