Table of Contents
Fetching ...

Differentially Private Release of Hierarchical Origin/Destination Data with a TopDown Approach

Fabrizio Boninsegna, Francesco Silvestri

TL;DR

The paper tackles private release of hierarchical origin-destination data under bounded differential privacy, introducing InfTDA, a TopDown mechanism that uses Chebyshev distance and an integer optimizer IntOpt to enforce non-negativity and hierarchical consistency. It provides a theoretical bound on the maximum absolute error and demonstrates reduced false positives while maintaining hierarchical accuracy, validated on real ISTAT O/D data and synthetic datasets. The approach generalizes TopDown to non-negative hierarchical trees and offers a practical, faster alternative to existing TDA variants, with broad applicability to other tabular hierarchies. Overall, it delivers high-utility DP O/D datasets that remain coherent across geographic scales, enabling reliable downstream marginal queries and decision-making.

Abstract

This paper presents a novel method for generating differentially private tabular datasets for hierarchical data, specifically focusing on origin-destination (O/D) trips. The approach builds upon the TopDown algorithm, a constraint-based mechanism developed by the U.S. Census to incorporate invariant queries into tabular data. O/D hierarchical data refers to datasets representing trips between geographical areas organized in a hierarchical structure (e.g., region $\rightarrow$ province $\rightarrow$ city). The proposed method is designed to improve the accuracy of queries covering broader geographical areas, which are derived through aggregation. This feature provides a "zoom-in" effect on the dataset, ensuring that when zoomed back out, the overall picture is preserved. Furthermore, the approach aims to reduce false positive detection. These characteristics can strengthen practitioners' and decision-makers' confidence in adopting differential privacy datasets. The main technical contribution of this paper includes a novel TopDown algorithm that employs constrained optimization with Chebyshev distance minimization, with theoretical guarantees on the maximum absolute error. Additionally, we propose a new integer optimization algorithm that significantly reduces the incidence of false positives. The effectiveness of the proposed approach is validated using real-world and synthetic O/D datasets, demonstrating its ability to generate private data with high utility and a reduced number of false positives. Our experiments focus on O/D datasets with a single trip as a unit of privacy: nevertheless, the proposed approach supports other units of privacy and also applies to any tabular data with a hierarchical structure.

Differentially Private Release of Hierarchical Origin/Destination Data with a TopDown Approach

TL;DR

The paper tackles private release of hierarchical origin-destination data under bounded differential privacy, introducing InfTDA, a TopDown mechanism that uses Chebyshev distance and an integer optimizer IntOpt to enforce non-negativity and hierarchical consistency. It provides a theoretical bound on the maximum absolute error and demonstrates reduced false positives while maintaining hierarchical accuracy, validated on real ISTAT O/D data and synthetic datasets. The approach generalizes TopDown to non-negative hierarchical trees and offers a practical, faster alternative to existing TDA variants, with broad applicability to other tabular hierarchies. Overall, it delivers high-utility DP O/D datasets that remain coherent across geographic scales, enabling reliable downstream marginal queries and decision-making.

Abstract

This paper presents a novel method for generating differentially private tabular datasets for hierarchical data, specifically focusing on origin-destination (O/D) trips. The approach builds upon the TopDown algorithm, a constraint-based mechanism developed by the U.S. Census to incorporate invariant queries into tabular data. O/D hierarchical data refers to datasets representing trips between geographical areas organized in a hierarchical structure (e.g., region province city). The proposed method is designed to improve the accuracy of queries covering broader geographical areas, which are derived through aggregation. This feature provides a "zoom-in" effect on the dataset, ensuring that when zoomed back out, the overall picture is preserved. Furthermore, the approach aims to reduce false positive detection. These characteristics can strengthen practitioners' and decision-makers' confidence in adopting differential privacy datasets. The main technical contribution of this paper includes a novel TopDown algorithm that employs constrained optimization with Chebyshev distance minimization, with theoretical guarantees on the maximum absolute error. Additionally, we propose a new integer optimization algorithm that significantly reduces the incidence of false positives. The effectiveness of the proposed approach is validated using real-world and synthetic O/D datasets, demonstrating its ability to generate private data with high utility and a reduced number of false positives. Our experiments focus on O/D datasets with a single trip as a unit of privacy: nevertheless, the proposed approach supports other units of privacy and also applies to any tabular data with a hierarchical structure.

Paper Structure

This paper contains 39 sections, 15 theorems, 21 equations, 6 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Given a O/D dataset with $g$ geographic levels. InfOPT with constant probability returns a differentially private tabular dataset, with maximum absolute error at most $\tilde{O}(\sqrt{\ell^3 g})$, for O/D flows with origin and destination at level $\ell \in \{0,\dots, g\}$.

Figures (6)

  • Figure 1: The first three hierarchical levels of Italy, according to ISTAT.
  • Figure 2: Example of the two-step construction for the destination tree, represented in the left figure from the top to the bottom. In Figure (a), we have two areas at level $\ell$, $u_\ell$ and $v_{\ell}$, and an arrow with attribute $q(u_\ell, v_\ell)$ indicating the flow between them. In Figure (b), the first step is depicted, the destination area $v_\ell$ is divided into its child areas $v_{\ell+1, 0}$ and $v_{\ell+1,1}$ (in this example, we used a bi-partition). The arrows indicate the cross-level range query of order one. In Figure (c), the last step is depicted, the origin area is divided as well, and the arrows indicate the intra-level query of the finer geographic level $\ell+1$. Figure (d) depicts the destination tree. The links assure hierarchical consistency such that the value of a node can be obtained as the sum of the values of its children.
  • Figure 3: Experiments run for the Italian dataset (from ISTAT). From left to right: maximum absolute error, false discovery rate, and execution time. The error bars indicate maximum and minimum values over 10 experiments.
  • Figure 4: Experiments run for the synthetic datasets with a focus on maximum absolute error. The error bars indicate maximum and minimum values over 10 experiments
  • Figure 5: False discovery rate for synthetic datasets
  • ...and 1 more figures

Theorems & Definitions (19)

  • Theorem 1: Informal version of utility of InfTDA
  • Definition 1: Differential Privacy (DP) dwork2014algorithmic
  • Definition 2: zero-Concentrated Differential Privacy (zCDP)bun2016concentrated
  • lemma 1: From $\rho$-zCDP to $(\varepsilon, \delta)$-DP (Lemma 21 in bun2016concentrated)
  • lemma 2: Post-Process Immunity (Lemma 8 bun2016concentrated)
  • lemma 3: Composition (from Lemma 7 in bun2016concentrated)
  • Theorem 2: Discrete Gaussian Mechanism canonne2020discrete
  • corollary 1: Corollary 9 canonne2020discrete
  • Theorem 3: SH-Stability-Based Histogram bun2019simultaneous
  • Definition 3: Non-Negative Hierarchical Tree
  • ...and 9 more