Table of Contents
Fetching ...

DiTEC-WDN: A Large-Scale Dataset of Hydraulic Scenarios across Multiple Water Distribution Networks

Huy Truong, Andrés Tello, Alexander Lazovik, Victoria Degeler

TL;DR

DiTEC-WDN addresses privacy restrictions on real water distribution networks by introducing a large-scale synthetic dataset of 36 networks with 228 million hourly state snapshots across short-term (24 h) and long-term (1 year) horizons. An automated pipeline combines hspo based parameter sampling with PSO optimization to generate hydraulically plausible states, including per-node demand patterns and daily/yearly cycles with noise, all validated by rule-based checks. Comparative validations against Baseline networks and LeakDB demonstrate broader, less redundant coverage of the demand-pressure space and greater inter-scenario diversity, enabling robust graph-level, node-level, and time-series learning tasks while preserving privacy. The dataset supports surrogate modeling, state estimation, and demand forecasting in a scalable, open benchmark, facilitated by its structured metadata, HPC-enabled simulation workflow, and permissive licensing for cross-study benchmarking.

Abstract

Privacy restrictions hinder the sharing of real-world Water Distribution Network (WDN) models, limiting the application of emerging data-driven machine learning, which typically requires extensive observations. To address this challenge, we propose the dataset DiTEC-WDN that comprises 36,000 unique scenarios simulated over either short-term (24 hours) or long-term (1 year) periods. We constructed this dataset using an automated pipeline that optimizes crucial parameters (e.g., pressure, flow rate, and demand patterns), facilitates large-scale simulations, and records discrete, synthetic but hydraulically realistic states under standard conditions via rule validation and post-hoc analysis. With a total of 228 million generated graph-based states, DiTEC-WDN can support a variety of machine-learning tasks, including graph-level, node-level, and link-level regression, as well as time-series forecasting. This contribution, released under a public license, encourages open scientific research in the critical water sector, eliminates the risk of exposing sensitive data, and fulfills the need for a large-scale water distribution network benchmark for study comparisons and scenario analysis.

DiTEC-WDN: A Large-Scale Dataset of Hydraulic Scenarios across Multiple Water Distribution Networks

TL;DR

DiTEC-WDN addresses privacy restrictions on real water distribution networks by introducing a large-scale synthetic dataset of 36 networks with 228 million hourly state snapshots across short-term (24 h) and long-term (1 year) horizons. An automated pipeline combines hspo based parameter sampling with PSO optimization to generate hydraulically plausible states, including per-node demand patterns and daily/yearly cycles with noise, all validated by rule-based checks. Comparative validations against Baseline networks and LeakDB demonstrate broader, less redundant coverage of the demand-pressure space and greater inter-scenario diversity, enabling robust graph-level, node-level, and time-series learning tasks while preserving privacy. The dataset supports surrogate modeling, state estimation, and demand forecasting in a scalable, open benchmark, facilitated by its structured metadata, HPC-enabled simulation workflow, and permissive licensing for cross-study benchmarking.

Abstract

Privacy restrictions hinder the sharing of real-world Water Distribution Network (WDN) models, limiting the application of emerging data-driven machine learning, which typically requires extensive observations. To address this challenge, we propose the dataset DiTEC-WDN that comprises 36,000 unique scenarios simulated over either short-term (24 hours) or long-term (1 year) periods. We constructed this dataset using an automated pipeline that optimizes crucial parameters (e.g., pressure, flow rate, and demand patterns), facilitates large-scale simulations, and records discrete, synthetic but hydraulically realistic states under standard conditions via rule validation and post-hoc analysis. With a total of 228 million generated graph-based states, DiTEC-WDN can support a variety of machine-learning tasks, including graph-level, node-level, and link-level regression, as well as time-series forecasting. This contribution, released under a public license, encourages open scientific research in the critical water sector, eliminates the risk of exposing sensitive data, and fulfills the need for a large-scale water distribution network benchmark for study comparisons and scenario analysis.

Paper Structure

This paper contains 12 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of the dataset generation. The left figure (a) shows a divide-and-conquer PSO optimizing a strategy's configuration. The right figure (b) depicts the usage of the optimized configuration to sample parameter sets and simulate diverse scenarios with unique characteristics (e.g., per-node demand patterns).
  • Figure 2: The folder organization structure. The DiTEC-WDN collection includes 36 represented as folders. Every folder contains metadata and seven output parameters while the number of input parameters varies based on the available components per network. The dataset metadata is fed into a Markdown (.md) file structured as Dataset Card mitchell2019datasetcard. In addition, parameter values are stored in one or more .parquet file(s), depending on their size. A .parquet file stores indices and node (link) values as distinct columns.
  • Figure 3: Density distribution of pressure and demand across in DiTEC-WDN (cyan) and original ones from Input files (orange). The contours denote the data point density of the DiTEC-WDN dataset, with darker blue indicating higher concentration at the center and lighter blue showing lower density when going outward. In baseline networks, data points whose pressure is outside the range of $(0,151]$ in meters, are excluded due to the impractical operation conditions truong2024gatres.
  • Figure 4: Correlation matrices of generated demands between all scenarios in Hanoi . The left figure (a) shows the correlation between scenarios in the data generated in LeakDB vrachimis2018leakdb. The right figure (b) shows the correlation between the scenarios in our dataset. Both matrices include all 1,000 scenarios, each containing 1-year of demand data. The low correlation between scenarios in our dataset shows the diversity of the data, contrary to the similarity observed across LeakDB scenarios.
  • Figure 5: Correlation matrices of generated demands between junction nodes in a randomly chosen scenario from Hanoi . The left figure (a) shows the correlation between junction demands in the data generated in LeakDB vrachimis2018leakdb. The right figure (b) shows the correlation between the junction demands in our dataset. The high correlation in LeakDB shows the overuse of demand patterns for several nodes, contrary to what it is observed in our dataset. The blocks in the correlation matrix of our dataset highlight the difference between household and commercial demand patterns.
  • ...and 1 more figures