Table of Contents
Fetching ...

Privacy-Preserving Optimal Parameter Selection for Collaborative Clustering

Maryam Ghasemian, Erman Ayday

TL;DR

This work addresses privacy-preserving parameter selection for collaborative clustering by introducing a semi-trusted server that recommends the optimal clustering algorithm and hyperparameters based on locally differentially private, randomized- response perturbed data. The approach uses ARI, Silhouette Score, CH, and accuracy to evaluate clustering quality while quantifying privacy risks, notably membership inference as the privacy budget $\epsilon$ increases. Experiments on the Obesity and Extended Iris datasets show that server recommendations are largely stable across perturbation levels and data-sharing amounts, with K-Means often favored and elbow-based parameter selection improving robustness. However, increasing $\epsilon$ raises membership inference risk, underscoring a trade-off between privacy and utility and highlighting the need for careful privacy-preserving design in server-assisted collaborative clustering.

Abstract

This study investigates the optimal selection of parameters for collaborative clustering while ensuring data privacy. We focus on key clustering algorithms within a collaborative framework, where multiple data owners combine their data. A semi-trusted server assists in recommending the most suitable clustering algorithm and its parameters. Our findings indicate that the privacy parameter ($ε$) minimally impacts the server's recommendations, but an increase in $ε$ raises the risk of membership inference attacks, where sensitive information might be inferred. To mitigate these risks, we implement differential privacy techniques, particularly the Randomized Response mechanism, to add noise and protect data privacy. Our approach demonstrates that high-quality clustering can be achieved while maintaining data confidentiality, as evidenced by metrics such as the Adjusted Rand Index and Silhouette Score. This study contributes to privacy-aware data sharing, optimal algorithm and parameter selection, and effective communication between data owners and the server.

Privacy-Preserving Optimal Parameter Selection for Collaborative Clustering

TL;DR

This work addresses privacy-preserving parameter selection for collaborative clustering by introducing a semi-trusted server that recommends the optimal clustering algorithm and hyperparameters based on locally differentially private, randomized- response perturbed data. The approach uses ARI, Silhouette Score, CH, and accuracy to evaluate clustering quality while quantifying privacy risks, notably membership inference as the privacy budget increases. Experiments on the Obesity and Extended Iris datasets show that server recommendations are largely stable across perturbation levels and data-sharing amounts, with K-Means often favored and elbow-based parameter selection improving robustness. However, increasing raises membership inference risk, underscoring a trade-off between privacy and utility and highlighting the need for careful privacy-preserving design in server-assisted collaborative clustering.

Abstract

This study investigates the optimal selection of parameters for collaborative clustering while ensuring data privacy. We focus on key clustering algorithms within a collaborative framework, where multiple data owners combine their data. A semi-trusted server assists in recommending the most suitable clustering algorithm and its parameters. Our findings indicate that the privacy parameter () minimally impacts the server's recommendations, but an increase in raises the risk of membership inference attacks, where sensitive information might be inferred. To mitigate these risks, we implement differential privacy techniques, particularly the Randomized Response mechanism, to add noise and protect data privacy. Our approach demonstrates that high-quality clustering can be achieved while maintaining data confidentiality, as evidenced by metrics such as the Adjusted Rand Index and Silhouette Score. This study contributes to privacy-aware data sharing, optimal algorithm and parameter selection, and effective communication between data owners and the server.
Paper Structure (19 sections, 4 figures, 7 tables)

This paper contains 19 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comprehensive five-step process, highlighting the interaction between multiple data owners and the server. We show how data are shared, processed for noise addition (to achieve differential privacy), and then utilized in a collaborative clustering algorithm, all while maintaining strict privacy protocols. In step (1), data owners add noise to part of their datasets using randomized response (RR). Data owners send a portion of their noisy data to the server in step (2). In step (3), the server applies various methods to find the optimum algorithm with its corresponding hyper parameter(s), and the server provides its outcome (algorithm and parameter) to the data owners in step (4). Finally, the data owners perform collaborative clustering based on server suggestions in step (5).
  • Figure 2: Visual Representation of Clustering Algorithm Performance Across Combined Datasets. This figure illustrates the performance metrics from Table \ref{['tab:server_suggestion_10_0.1']} for various clustering algorithms—GMM, DBSCAN, K-Means, and Hierarchical Clustering (HC)—evaluated under conditions of 10% data sharing and a privacy parameter of $\epsilon = 0.1$. Performance metrics including Adjusted Rand Index (ARI), Homogeneity (Homo), Completeness (Comp), Silhouette Score, Calinski-Harabasz Index (CH), and Accuracy are plotted. Algorithms recommended by the server are highlighted with dots, showcasing their superior performance in comparison to others in each dataset scenario.
  • Figure 3: (a):Contrast in dataset #1 with Overlapping Clusters ($\epsilon$ = 1, 5, 10): This part displays the differences between original ('O') and noise-modified ('X') data in closely positioned clusters, colored blue and red. (b) : Comparison in dataset #1 with Clear Cluster Gaps ($\epsilon$ = 1, 5, 10): Here, the focus is on the impact of the Randomized Response (RR) method on data (original 'O', noisy 'X') in maintaining cluster gaps despite noise variations, balancing privacy with data structure integrity. (c): Original vs. Noisy Data in dataset #2 ($\epsilon$ = 1, 5, 10): This section compares original ('O') and noise-affected ('X') data at different privacy levels, using blue, red, and green to show cluster separation effectiveness via the RR mechanism. Note: Plots can be zoomed in for clearer visualization.
  • Figure 4: Analysis of Membership Inference Attack Risks: This figure illustrates the increasing likelihood of data identification in two datasets as privacy parameters ($\epsilon$) increase. The blue bars represent dataset #1, and the green bars represent dataset #2, highlighting the direct correlation between reduced noise levels and heightened data vulnerability.