Table of Contents
Fetching ...

Privacy-Optimized Randomized Response for Sharing Multi-Attribute Data

Akito Yamamoto, Tetsuo Shibuya

TL;DR

This work tackles the challenge of privately sharing multi-attribute categorical data under differential privacy. It introduces a privacy-optimized randomized response mechanism and a scalable inductive heuristic that together provide stronger dataset-wide privacy than existing Kronecker-product methods, with a time complexity of $O(k^2)$ suitable for large attribute sets. The core contributions include an exact linear programming formulation for small to moderate $k$, a near-optimal inductive method for large $k$, and empirical evidence demonstrating reduced output error in genome-statistics analyses compared to prior approaches. The methods are shown to be practical in runtime (e.g., $k=1000$) and yield meaningful gains in both privacy guarantees and analytical accuracy, making them relevant for trustworthy sharing and analysis of high-dimensional multi-attribute data, including genomic and healthcare datasets.

Abstract

With the increasing amount of data in society, privacy concerns in data sharing have become widely recognized. Particularly, protecting personal attribute information is essential for a wide range of aims from crowdsourcing to realizing personalized medicine. Although various differentially private methods based on randomized response have been proposed for single attribute information or specific analysis purposes such as frequency estimation, there is a lack of studies on the mechanism for sharing individuals' multiple categorical information itself. The existing randomized response for sharing multi-attribute data uses the Kronecker product to perturb each attribute information in turn according to the respective privacy level but achieves only a weak privacy level for the entire dataset. Therefore, in this study, we propose a privacy-optimized randomized response that guarantees the strongest privacy in sharing multi-attribute data. Furthermore, we present an efficient heuristic algorithm for constructing a near-optimal mechanism. The time complexity of our algorithm is O(k^2), where k is the number of attributes, and it can be performed in about 1 second even for large datasets with k = 1,000. The experimental results demonstrate that both of our methods provide significantly stronger privacy guarantees for the entire dataset than the existing method. In addition, we show an analysis example using genome statistics to confirm that our methods can achieve less than half the output error compared with that of the existing method. Overall, this study is an important step toward trustworthy sharing and analysis of multi-attribute data. The Python implementation of our experiments and supplemental results are available at https://github.com/ay0408/Optimized-RR.

Privacy-Optimized Randomized Response for Sharing Multi-Attribute Data

TL;DR

This work tackles the challenge of privately sharing multi-attribute categorical data under differential privacy. It introduces a privacy-optimized randomized response mechanism and a scalable inductive heuristic that together provide stronger dataset-wide privacy than existing Kronecker-product methods, with a time complexity of suitable for large attribute sets. The core contributions include an exact linear programming formulation for small to moderate , a near-optimal inductive method for large , and empirical evidence demonstrating reduced output error in genome-statistics analyses compared to prior approaches. The methods are shown to be practical in runtime (e.g., ) and yield meaningful gains in both privacy guarantees and analytical accuracy, making them relevant for trustworthy sharing and analysis of high-dimensional multi-attribute data, including genomic and healthcare datasets.

Abstract

With the increasing amount of data in society, privacy concerns in data sharing have become widely recognized. Particularly, protecting personal attribute information is essential for a wide range of aims from crowdsourcing to realizing personalized medicine. Although various differentially private methods based on randomized response have been proposed for single attribute information or specific analysis purposes such as frequency estimation, there is a lack of studies on the mechanism for sharing individuals' multiple categorical information itself. The existing randomized response for sharing multi-attribute data uses the Kronecker product to perturb each attribute information in turn according to the respective privacy level but achieves only a weak privacy level for the entire dataset. Therefore, in this study, we propose a privacy-optimized randomized response that guarantees the strongest privacy in sharing multi-attribute data. Furthermore, we present an efficient heuristic algorithm for constructing a near-optimal mechanism. The time complexity of our algorithm is O(k^2), where k is the number of attributes, and it can be performed in about 1 second even for large datasets with k = 1,000. The experimental results demonstrate that both of our methods provide significantly stronger privacy guarantees for the entire dataset than the existing method. In addition, we show an analysis example using genome statistics to confirm that our methods can achieve less than half the output error compared with that of the existing method. Overall, this study is an important step toward trustworthy sharing and analysis of multi-attribute data. The Python implementation of our experiments and supplemental results are available at https://github.com/ay0408/Optimized-RR.
Paper Structure (20 sections, 58 equations, 5 figures, 1 table, 3 algorithms)

This paper contains 20 sections, 58 equations, 5 figures, 1 table, 3 algorithms.

Figures (5)

  • Figure 1: Achieved privacy levels for the entire dataset when (a) $k = 3$, (b) $k = 5$, (c) $k = 7$, and (d) $k = 10$. The $x$-axis represents $\sum_{i=1}^k \epsilon_i$; that is, the sum of the privacy level for each attribute information. The $y$-axis represents the ratio of the achieved privacy level to $\sum_{i=1}^k \epsilon_i$. We compared the existing Kronecker product-based method (black, solid), the optimal mechanism (red, dash-dot), and our heuristic method (blue, dashed).
  • Figure 2: Optimality of our heuristic method when (a) $k = 3$, (b) $k = 5$, (c) $k = 7$, and (d) $k = 10$. The $x$-axis represents $\sum_{i=1}^k \epsilon_i$; that is, the sum of the privacy level for each attribute information. The $y$-axis represents the ratio of the achieved privacy level to the optimal solution.
  • Figure 3: Achieved privacy levels for the entire dataset when (a) $k = 3$, (b) $k = 5$, (c) $k = 7$, and (d) $k = 10$. The $x$-axis represents $\sum_{i=1}^k a_i$; that is, the sum of the number of possible attribute values for each attribute information. The $y$-axis represents the ratio of the achieved privacy level to $\sum_{i=1}^k \epsilon_i$. We compared the existing Kronecker product-based method (black, solid), the optimal mechanism (red, dash-dot), and our heuristic method (blue, dashed).
  • Figure 4: Optimality of our heuristic method when (a) $k = 3$, (b) $k = 5$, (c) $k = 7$, and (d) $k = 10$. The $x$-axis represents $\sum_{i=1}^k a_i$; that is, the sum of the number of possible attribute values for each attribute information. The $y$-axis represents the ratio of the achieved privacy level to the optimal solution.
  • Figure 5: Comparison of the accuracy of $\chi^2$-statistics between the existing Kronecker product-based method (black, solid) and our heuristic method (blue, dashed) when the privacy level for the entire datset is fixed. The $x$-axis represents the number of SNPs; that is, the number of $k$. The $y$-axis represents the average difference between the original and differentially private statistics. The error bar represents the range of all results.

Theorems & Definitions (4)

  • Definition 1
  • proof
  • proof
  • proof