Privacy-Preserving Optimal Parameter Selection for Collaborative Clustering
Maryam Ghasemian, Erman Ayday
TL;DR
This work addresses privacy-preserving parameter selection for collaborative clustering by introducing a semi-trusted server that recommends the optimal clustering algorithm and hyperparameters based on locally differentially private, randomized- response perturbed data. The approach uses ARI, Silhouette Score, CH, and accuracy to evaluate clustering quality while quantifying privacy risks, notably membership inference as the privacy budget $\epsilon$ increases. Experiments on the Obesity and Extended Iris datasets show that server recommendations are largely stable across perturbation levels and data-sharing amounts, with K-Means often favored and elbow-based parameter selection improving robustness. However, increasing $\epsilon$ raises membership inference risk, underscoring a trade-off between privacy and utility and highlighting the need for careful privacy-preserving design in server-assisted collaborative clustering.
Abstract
This study investigates the optimal selection of parameters for collaborative clustering while ensuring data privacy. We focus on key clustering algorithms within a collaborative framework, where multiple data owners combine their data. A semi-trusted server assists in recommending the most suitable clustering algorithm and its parameters. Our findings indicate that the privacy parameter ($ε$) minimally impacts the server's recommendations, but an increase in $ε$ raises the risk of membership inference attacks, where sensitive information might be inferred. To mitigate these risks, we implement differential privacy techniques, particularly the Randomized Response mechanism, to add noise and protect data privacy. Our approach demonstrates that high-quality clustering can be achieved while maintaining data confidentiality, as evidenced by metrics such as the Adjusted Rand Index and Silhouette Score. This study contributes to privacy-aware data sharing, optimal algorithm and parameter selection, and effective communication between data owners and the server.
