Table of Contents
Fetching ...

CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series

Gideon Stein, Maha Shadaydeh, Jan Blunk, Niklas Penzel, Joachim Denzler

TL;DR

CausalRivers delivers the largest real-world benchmarking kit for causal discovery on time-series, using extensive river-discharge data with ground-truth graphs to stress-test methods under conditions like high dimensionality, non-stationarity, and distributional shifts. The authors provide a ready-to-use pipeline, multiple baselines, and a suite of experiments that reveal robustness gaps in state-of-the-art methods and the value of domain adaptation. Key findings show that simple, well-tuned baselines often rival complex models, while domain adaptation via fine-tuning can yield meaningful gains across diverse subgraphs and data regimes. This benchmark has practical impact by enabling robust, benchmark-driven method development in hydrology and related time-series domains, with potential extensions to forecasting and anomaly detection.

Abstract

Causal discovery, or identifying causal relationships from observational data, is a notoriously challenging task, with numerous methods proposed to tackle it. Despite this, in-the-wild evaluation of these methods is still lacking, as works frequently rely on synthetic data evaluation and sparse real-world examples under critical theoretical assumptions. Real-world causal structures, however, are often complex, making it hard to decide on a proper causal discovery strategy. To bridge this gap, we introduce CausalRivers, the largest in-the-wild causal discovery benchmarking kit for time-series data to date. CausalRivers features an extensive dataset on river discharge that covers the eastern German territory (666 measurement stations) and the state of Bavaria (494 measurement stations). It spans the years 2019 to 2023 with a 15-minute temporal resolution. Further, we provide additional data from a flood around the Elbe River, as an event with a pronounced distributional shift. Leveraging multiple sources of information and time-series meta-data, we constructed two distinct causal ground truth graphs (Bavaria and eastern Germany). These graphs can be sampled to generate thousands of subgraphs to benchmark causal discovery across diverse and challenging settings. To demonstrate the utility of CausalRivers, we evaluate several causal discovery approaches through a set of experiments to identify areas for improvement. CausalRivers has the potential to facilitate robust evaluations and comparisons of causal discovery methods. Besides this primary purpose, we also expect that this dataset will be relevant for connected areas of research, such as time-series forecasting and anomaly detection. Based on this, we hope to push benchmark-driven method development that fosters advanced techniques for causal discovery, as is the case for many other areas of machine learning.

CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series

TL;DR

CausalRivers delivers the largest real-world benchmarking kit for causal discovery on time-series, using extensive river-discharge data with ground-truth graphs to stress-test methods under conditions like high dimensionality, non-stationarity, and distributional shifts. The authors provide a ready-to-use pipeline, multiple baselines, and a suite of experiments that reveal robustness gaps in state-of-the-art methods and the value of domain adaptation. Key findings show that simple, well-tuned baselines often rival complex models, while domain adaptation via fine-tuning can yield meaningful gains across diverse subgraphs and data regimes. This benchmark has practical impact by enabling robust, benchmark-driven method development in hydrology and related time-series domains, with potential extensions to forecasting and anomaly detection.

Abstract

Causal discovery, or identifying causal relationships from observational data, is a notoriously challenging task, with numerous methods proposed to tackle it. Despite this, in-the-wild evaluation of these methods is still lacking, as works frequently rely on synthetic data evaluation and sparse real-world examples under critical theoretical assumptions. Real-world causal structures, however, are often complex, making it hard to decide on a proper causal discovery strategy. To bridge this gap, we introduce CausalRivers, the largest in-the-wild causal discovery benchmarking kit for time-series data to date. CausalRivers features an extensive dataset on river discharge that covers the eastern German territory (666 measurement stations) and the state of Bavaria (494 measurement stations). It spans the years 2019 to 2023 with a 15-minute temporal resolution. Further, we provide additional data from a flood around the Elbe River, as an event with a pronounced distributional shift. Leveraging multiple sources of information and time-series meta-data, we constructed two distinct causal ground truth graphs (Bavaria and eastern Germany). These graphs can be sampled to generate thousands of subgraphs to benchmark causal discovery across diverse and challenging settings. To demonstrate the utility of CausalRivers, we evaluate several causal discovery approaches through a set of experiments to identify areas for improvement. CausalRivers has the potential to facilitate robust evaluations and comparisons of causal discovery methods. Besides this primary purpose, we also expect that this dataset will be relevant for connected areas of research, such as time-series forecasting and anomaly detection. Based on this, we hope to push benchmark-driven method development that fosters advanced techniques for causal discovery, as is the case for many other areas of machine learning.

Paper Structure

This paper contains 16 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The causal ground truth graphs for river discharge measurement stations are provided with this benchmarking kit. Jointly, these two graphs hold over 1000 nodes. Different colors represent different data origins that we specify in \ref{['app:1']}.
  • Figure 2: Left/Top: A single sampled causal relationship along with time-series data from RiversEastGermany. A massive precipitation event is marked in red. Right bottom: The pronounced distributional shift between the same nodes of the Elbe in RiversEastGermany and RiversElbeFlood.
  • Figure 3: AUROC scores for Experiment Set 2 (a) and Experiment Set 3 (b). We mark increases and in performance with $\uparrow$. Further, the highest performance per method is marked in bold.
  • Figure 4: Left/Center: Annual discharge patterns (Mean over 5 years) of the biggest rivers (Elbe or Danube) in the three datasets. Notably, the Elbe shows a more pronounced annual cycle than the Danube, emphasizing distributional differences between the two datasets. Right: Discharge pattern of the Elbe river in the RiversElbeFlood dataset. A strong and sudden increase in discharge can be observed.
  • Figure 5: Distribution over the average discharge in the three CausalRivers time-series datasets. Notably, a few big rivers (e.g., Elbe, Danube, Oder) show vastly higher average discharges.
  • ...and 3 more figures