Table of Contents
Fetching ...

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

Huiyang Yi, Xiaojian Shen, Yonggang Wu, Duxin Chen, He Wang, Wenwu Yu

TL;DR

This paper tackles the robustness gap in time-series causal discovery by introducing CausalCompass, a flexible benchmark suite that evaluates TSCD methods under eight misspecification scenarios. It benchmarks 11 methods across linear and nonlinear vanilla models, revealing that deep learning-based approaches generally provide the strongest robustness, while preprocessing choices like standardization can significantly alter performance (notably benefiting NTS-NOTEARS). The study provides extensive hyperparameter analyses and demonstrates that there is no universally best method, underscoring the need for robustness-focused evaluation in real-world deployments. By releasing code and datasets, the work aims to promote broader adoption of TSCD in practice and encourages further research into robust DL-based TSCD methods and preprocessing-aware evaluations.

Abstract

Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The code and datasets are available at https://github.com/huiyang-yi/CausalCompass.

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

TL;DR

This paper tackles the robustness gap in time-series causal discovery by introducing CausalCompass, a flexible benchmark suite that evaluates TSCD methods under eight misspecification scenarios. It benchmarks 11 methods across linear and nonlinear vanilla models, revealing that deep learning-based approaches generally provide the strongest robustness, while preprocessing choices like standardization can significantly alter performance (notably benefiting NTS-NOTEARS). The study provides extensive hyperparameter analyses and demonstrates that there is no universally best method, underscoring the need for robustness-focused evaluation in real-world deployments. By releasing code and datasets, the work aims to promote broader adoption of TSCD in practice and encourages further research into robust DL-based TSCD methods and preprocessing-aware evaluations.

Abstract

Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The code and datasets are available at https://github.com/huiyang-yi/CausalCompass.
Paper Structure (38 sections, 12 equations, 18 figures, 88 tables)

This paper contains 38 sections, 12 equations, 18 figures, 88 tables.

Figures (18)

  • Figure 1: Experimental results under the linear and nonlinear settings across the vanilla scenario and eight assumption violation scenarios. AUROC and AUPRC (the higher the better) are evaluated over 5 trials for the 10-node case with $T = 1000$. For the deep learning-based methods, we present only the optimal results.
  • Figure 2: Experimental results under the linear and nonlinear settings across the vanilla scenario and eight assumption violation scenarios. AUROC and AUPRC (the higher the better) are evaluated over 5 trials for the 10-node case with $T = 500$. For the deep learning-based methods, we present only the optimal results.
  • Figure 3: Experimental results under the nonlinear settings across the vanilla scenario and eight assumption violation scenarios. AUROC and AUPRC (the higher the better) are evaluated over 5 trials for the 10-node case with $F = 40$. For the deep learning-based methods, we present only the optimal results.
  • Figure 4: Experimental results under the linear and nonlinear settings across the vanilla scenario and eight assumption violation scenarios. AUROC and AUPRC (the higher the better) are evaluated over 5 trials for the 15-node case with $T = 500$. For the deep learning-based methods, we present only the optimal results.
  • Figure 5: Experimental results under the linear and nonlinear settings across the vanilla scenario and eight assumption violation scenarios. AUROC and AUPRC (the higher the better) are evaluated over 5 trials for the 15-node case with $T = 1000$. For the deep learning-based methods, we present only the optimal results.
  • ...and 13 more figures