DiTEC-WDN: A Large-Scale Dataset of Hydraulic Scenarios across Multiple Water Distribution Networks
Huy Truong, Andrés Tello, Alexander Lazovik, Victoria Degeler
TL;DR
DiTEC-WDN addresses privacy restrictions on real water distribution networks by introducing a large-scale synthetic dataset of 36 networks with 228 million hourly state snapshots across short-term (24 h) and long-term (1 year) horizons. An automated pipeline combines hspo based parameter sampling with PSO optimization to generate hydraulically plausible states, including per-node demand patterns and daily/yearly cycles with noise, all validated by rule-based checks. Comparative validations against Baseline networks and LeakDB demonstrate broader, less redundant coverage of the demand-pressure space and greater inter-scenario diversity, enabling robust graph-level, node-level, and time-series learning tasks while preserving privacy. The dataset supports surrogate modeling, state estimation, and demand forecasting in a scalable, open benchmark, facilitated by its structured metadata, HPC-enabled simulation workflow, and permissive licensing for cross-study benchmarking.
Abstract
Privacy restrictions hinder the sharing of real-world Water Distribution Network (WDN) models, limiting the application of emerging data-driven machine learning, which typically requires extensive observations. To address this challenge, we propose the dataset DiTEC-WDN that comprises 36,000 unique scenarios simulated over either short-term (24 hours) or long-term (1 year) periods. We constructed this dataset using an automated pipeline that optimizes crucial parameters (e.g., pressure, flow rate, and demand patterns), facilitates large-scale simulations, and records discrete, synthetic but hydraulically realistic states under standard conditions via rule validation and post-hoc analysis. With a total of 228 million generated graph-based states, DiTEC-WDN can support a variety of machine-learning tasks, including graph-level, node-level, and link-level regression, as well as time-series forecasting. This contribution, released under a public license, encourages open scientific research in the critical water sector, eliminates the risk of exposing sensitive data, and fulfills the need for a large-scale water distribution network benchmark for study comparisons and scenario analysis.
