Table of Contents
Fetching ...

DUEF-GA: Data Utility and Privacy Evaluation Framework for Graph Anonymization

Jordi Casas-Roma

TL;DR

This paper addresses the lack of standardized evaluation for graph anonymization by introducing DUEF-GA, a framework that combines generic information loss (GIL) metrics with task-specific information loss (SIL) measures, plus re-identification risk assessments. It formalizes an evaluation workflow where original graphs are perturbed to produce anonymized versions, which are then compared across both GIL and SIL to quantify structural, spectral, and task-driven utility losses. The framework supports a pipeline for analyzing multiple anonymization scenarios, including graph modification and differential privacy, and is demonstrated on real-world graphs through three application scenarios, highlighting how to choose parameters to balance privacy and utility. By providing objective scoring for both global graph properties and application-centric tasks such as community detection and information flow, the approach enables fair comparisons across methods and guides practitioners in selecting appropriate anonymization settings. The work contributes a practical, GPL-licensed toolkit and a comprehensive suite of metrics for researchers and practitioners working with privacy-preserving graph data, with implications for safer data sharing and more informed methodological choices.

Abstract

Anonymization of graph-based data is a problem which has been widely studied over the last years and several anonymization methods have been developed. Information loss measures have been used to evaluate data utility and information loss in the anonymized graphs. However, there is no consensus about how to evaluate data utility and information loss in privacy-preserving and anonymization scenarios, where the anonymous datasets were perturbed to hinder re-identification processes. Authors use diverse metrics to evaluate data utility and, consequently, it is complex to compare different methods or algorithms in literature. In this paper we propose a framework to evaluate and compare anonymous datasets in a common way, providing an objective score to clearly compare methods and algorithms. Our framework includes metrics based on generic information loss measures, such as average distance or betweenness centrality, and also task-specific information loss measures, such as community detection or information flow. Additionally, we provide some metrics to examine re-identification and risk assessment. We demonstrate that our framework could help researchers and practitioners to select the best parametrization and/or algorithm to reduce information loss and maximize data utility.

DUEF-GA: Data Utility and Privacy Evaluation Framework for Graph Anonymization

TL;DR

This paper addresses the lack of standardized evaluation for graph anonymization by introducing DUEF-GA, a framework that combines generic information loss (GIL) metrics with task-specific information loss (SIL) measures, plus re-identification risk assessments. It formalizes an evaluation workflow where original graphs are perturbed to produce anonymized versions, which are then compared across both GIL and SIL to quantify structural, spectral, and task-driven utility losses. The framework supports a pipeline for analyzing multiple anonymization scenarios, including graph modification and differential privacy, and is demonstrated on real-world graphs through three application scenarios, highlighting how to choose parameters to balance privacy and utility. By providing objective scoring for both global graph properties and application-centric tasks such as community detection and information flow, the approach enables fair comparisons across methods and guides practitioners in selecting appropriate anonymization settings. The work contributes a practical, GPL-licensed toolkit and a comprehensive suite of metrics for researchers and practitioners working with privacy-preserving graph data, with implications for safer data sharing and more informed methodological choices.

Abstract

Anonymization of graph-based data is a problem which has been widely studied over the last years and several anonymization methods have been developed. Information loss measures have been used to evaluate data utility and information loss in the anonymized graphs. However, there is no consensus about how to evaluate data utility and information loss in privacy-preserving and anonymization scenarios, where the anonymous datasets were perturbed to hinder re-identification processes. Authors use diverse metrics to evaluate data utility and, consequently, it is complex to compare different methods or algorithms in literature. In this paper we propose a framework to evaluate and compare anonymous datasets in a common way, providing an objective score to clearly compare methods and algorithms. Our framework includes metrics based on generic information loss measures, such as average distance or betweenness centrality, and also task-specific information loss measures, such as community detection or information flow. Additionally, we provide some metrics to examine re-identification and risk assessment. We demonstrate that our framework could help researchers and practitioners to select the best parametrization and/or algorithm to reduce information loss and maximize data utility.

Paper Structure

This paper contains 30 sections, 17 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Experimental framework. The original dataset $G$ is perturbed to produce a sequence of anonymized graphs, i. e. $\widetilde{G}_1, \ldots, \widetilde{G}_p$, using some anonymization method. Next, we compare the original and perturbed data using GIL measures in order to quantify the noise introduced on the data. Then, we do the same with real graph-mining processes and task-specific measures.
  • Figure 2: Framework for evaluating the clustering-specific information loss measure.
  • Figure 3: Examples of our framework results for Scenario II. The horizontal axis presents the anonymization level ($k$-anonymity value), while vertical axis indicates the value of the original graph (leftmost point) and the evolution during anonymization processes.
  • Figure 4: Examples of our framework results for scenario III. The horizontal axis presents the anonymization (randomization %), while vertical axis indicates the value of the original graph (leftmost point) and the evolution during anonymization processes.