Table of Contents
Fetching ...

Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment

Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Monica Maria Lucia Sebillo, Giandomenico Solimando

TL;DR

This work addresses the privacy-utility conflict in sharing and analyzing data by exploring whether Large Language Models (LLMs) can augment anonymized datasets without compromising confidentiality. It formalizes the augmentation problem under $k$-anonymity, and introduces a multimodal prompt engineering framework to guide LLMs in generating anonymized, yet useful, synthetic records. The authors evaluate ChatGPT and Claude-3 variants (Sonnet and Haiku) across two real-world datasets (Adult and Italia) using three anonymization algorithms (Mondrian, Top-Down Greedy, and Clustering-Based $k$-Anonymisation) and quantify anonymity via the pyCanon library. Key findings indicate that data utility and privacy balance depend on both dataset characteristics and the chosen anonymization method, with Clustering-Based Anonymisation (CBA) generally offering the best foundation for coherent augmentation; results vary as the target $k$ changes, and larger or more diverse datasets improve adherence to $k$-anonymity. The work demonstrates the feasibility of LLM-assisted anonymized data enrichment and points to future directions such as incorporating differential privacy and broadening domain coverage for more robust, privacy-preserving analytics.

Abstract

Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension, and their application to data archives might facilitate the privatization of sensitive information about the data subjects. In fact, the information contained in data often includes sensitive and personally identifiable details. This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification. Furthermore, the application of anonymisation techniques, such as k-anonymity, can lead to a significant reduction in the amount of data within data sources, which may reduce the efficacy of predictive processes. In our study, we investigate the capabilities offered by LLMs to enrich anonymized data sources without affecting their anonymity. To this end, we designed new ad-hoc prompt template engineering strategies to perform anonymized Data Augmentation and assess the effectiveness of LLM-based approaches in providing anonymized data. To validate the anonymization guarantees provided by LLMs, we exploited the pyCanon library, designed to assess the values of the parameters associated with the most common privacy-preserving techniques via anonymization. Our experiments conducted on real-world datasets demonstrate that LLMs yield promising results for this goal.

Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment

TL;DR

This work addresses the privacy-utility conflict in sharing and analyzing data by exploring whether Large Language Models (LLMs) can augment anonymized datasets without compromising confidentiality. It formalizes the augmentation problem under -anonymity, and introduces a multimodal prompt engineering framework to guide LLMs in generating anonymized, yet useful, synthetic records. The authors evaluate ChatGPT and Claude-3 variants (Sonnet and Haiku) across two real-world datasets (Adult and Italia) using three anonymization algorithms (Mondrian, Top-Down Greedy, and Clustering-Based -Anonymisation) and quantify anonymity via the pyCanon library. Key findings indicate that data utility and privacy balance depend on both dataset characteristics and the chosen anonymization method, with Clustering-Based Anonymisation (CBA) generally offering the best foundation for coherent augmentation; results vary as the target changes, and larger or more diverse datasets improve adherence to -anonymity. The work demonstrates the feasibility of LLM-assisted anonymized data enrichment and points to future directions such as incorporating differential privacy and broadening domain coverage for more robust, privacy-preserving analytics.

Abstract

Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension, and their application to data archives might facilitate the privatization of sensitive information about the data subjects. In fact, the information contained in data often includes sensitive and personally identifiable details. This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification. Furthermore, the application of anonymisation techniques, such as k-anonymity, can lead to a significant reduction in the amount of data within data sources, which may reduce the efficacy of predictive processes. In our study, we investigate the capabilities offered by LLMs to enrich anonymized data sources without affecting their anonymity. To this end, we designed new ad-hoc prompt template engineering strategies to perform anonymized Data Augmentation and assess the effectiveness of LLM-based approaches in providing anonymized data. To validate the anonymization guarantees provided by LLMs, we exploited the pyCanon library, designed to assess the values of the parameters associated with the most common privacy-preserving techniques via anonymization. Our experiments conducted on real-world datasets demonstrate that LLMs yield promising results for this goal.

Paper Structure

This paper contains 16 sections, 3 tables.