Table of Contents
Fetching ...

An Open Source Python Library for Anonymizing Sensitive Data

Judith Sáinz-Pardo Díaz, Álvaro López García

TL;DR

The paper presents an open-source Python library, anjana, for anonymizing sensitive tabular data to support open science while complying with data protection regulations. It implements nine anonymization techniques, including the standard $k$-anonymity, ($\alpha$,k)-anonymity, $\ell$-diversity, entropy $\ell$-diversity, recursive ($c$,$\ell$)-diversity, $t$-closeness, $\delta$-disclosure privacy, and both basic and enhanced $\beta$-likeness, with support for a single sensitive attribute and workflows for multiple SAs. The tool emphasizes local data processing, hierarchy-based generalization, and integration with the pycanon library to verify achieved anonymity levels, accompanied by robust software engineering practices (dependency management with Poetry, CI/CD, unit/functional tests, release automation, and documentation). It is designed to fit into ML/DL workflows, offering reproducible privacy-preserving data handling, easy installation, extensive testing, and automated releases to PyPI, with documentation hosted on ReadTheDocs.

Abstract

Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.

An Open Source Python Library for Anonymizing Sensitive Data

TL;DR

The paper presents an open-source Python library, anjana, for anonymizing sensitive tabular data to support open science while complying with data protection regulations. It implements nine anonymization techniques, including the standard -anonymity, (,k)-anonymity, -diversity, entropy -diversity, recursive (,)-diversity, -closeness, -disclosure privacy, and both basic and enhanced -likeness, with support for a single sensitive attribute and workflows for multiple SAs. The tool emphasizes local data processing, hierarchy-based generalization, and integration with the pycanon library to verify achieved anonymity levels, accompanied by robust software engineering practices (dependency management with Poetry, CI/CD, unit/functional tests, release automation, and documentation). It is designed to fit into ML/DL workflows, offering reproducible privacy-preserving data handling, easy installation, extensive testing, and automated releases to PyPI, with documentation hosted on ReadTheDocs.

Abstract

Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.
Paper Structure (6 sections, 2 figures, 6 tables)

This paper contains 6 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Workflow: select the anonymity technique to be applied depending on the privacy objective.
  • Figure 2: Classic workflow of a data-based AI project including the training/validation and testing phase of ML/DL models and the data anonymization process.

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3