Table of Contents
Fetching ...

ClustEm4Ano: Clustering Text Embeddings of Nominal Textual Attributes for Microdata Anonymization

Robert Aufschläger, Sebastian Wilhelm, Michael Heigl, Martin Schramm

TL;DR

This work addresses the challenge of anonymizing tabular data with nominal textual quasi-identifiers by reducing reliance on manually crafted value generalization hierarchies (VGHs). It introduces ClustEm4Ano, a pipeline that automatically generates multi-level VGHs by clustering text embeddings (across 13 models) and integrates them into k-anonymity and related privacy constraints via a FLASH-based generalization implemented through ARX. The experimental results on the Adult dataset show that embedding-driven VGHs can achieve better downstream ML efficacy and data utility than manually constructed baselines, particularly for small $k$ ($2\le k \le 30$), while maintaining reasonable record retention. The approach offers a scalable, semantics-aware alternative for privacy-preserving microdata publishing, with public code and potential applicability across domains, though it invites further exploration of clustering quality, other privacy models, and domain-specific embeddings.

Abstract

This work introduces ClustEm4Ano, an anonymization pipeline that can be used for generalization and suppression-based anonymization of nominal textual tabular data. It automatically generates value generalization hierarchies (VGHs) that, in turn, can be used to generalize attributes in quasi-identifiers. The pipeline leverages embeddings to generate semantically close value generalizations through iterative clustering. We applied KMeans and Hierarchical Agglomerative Clustering on $13$ different predefined text embeddings (both open and closed-source (via APIs)). Our approach is experimentally tested on a well-known benchmark dataset for anonymization: The UCI Machine Learning Repository's Adult dataset. ClustEm4Ano supports anonymization procedures by offering more possibilities compared to using arbitrarily chosen VGHs. Experiments demonstrate that these VGHs can outperform manually constructed ones in terms of downstream efficacy (especially for small $k$-anonymity ($2 \leq k \leq 30$)) and therefore can foster the quality of anonymized datasets. Our implementation is made public.

ClustEm4Ano: Clustering Text Embeddings of Nominal Textual Attributes for Microdata Anonymization

TL;DR

This work addresses the challenge of anonymizing tabular data with nominal textual quasi-identifiers by reducing reliance on manually crafted value generalization hierarchies (VGHs). It introduces ClustEm4Ano, a pipeline that automatically generates multi-level VGHs by clustering text embeddings (across 13 models) and integrates them into k-anonymity and related privacy constraints via a FLASH-based generalization implemented through ARX. The experimental results on the Adult dataset show that embedding-driven VGHs can achieve better downstream ML efficacy and data utility than manually constructed baselines, particularly for small (), while maintaining reasonable record retention. The approach offers a scalable, semantics-aware alternative for privacy-preserving microdata publishing, with public code and potential applicability across domains, though it invites further exploration of clustering quality, other privacy models, and domain-specific embeddings.

Abstract

This work introduces ClustEm4Ano, an anonymization pipeline that can be used for generalization and suppression-based anonymization of nominal textual tabular data. It automatically generates value generalization hierarchies (VGHs) that, in turn, can be used to generalize attributes in quasi-identifiers. The pipeline leverages embeddings to generate semantically close value generalizations through iterative clustering. We applied KMeans and Hierarchical Agglomerative Clustering on different predefined text embeddings (both open and closed-source (via APIs)). Our approach is experimentally tested on a well-known benchmark dataset for anonymization: The UCI Machine Learning Repository's Adult dataset. ClustEm4Ano supports anonymization procedures by offering more possibilities compared to using arbitrarily chosen VGHs. Experiments demonstrate that these VGHs can outperform manually constructed ones in terms of downstream efficacy (especially for small -anonymity ()) and therefore can foster the quality of anonymized datasets. Our implementation is made public.

Paper Structure

This paper contains 13 sections, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: Visualizing Mistral AI embeddings with Uniform Manifold Approximation and Projection (UMAP) (left) and the corresponding VGH (right) obtained by Agglomerative Hierarchical Clustering of embeddings of the values from the Adult's misc_adult_2 attribute education.
  • Figure 2: Visualization of the embedding similarity and distribution-based motivation for ClustEm4Ano.
  • Figure 3: ClustEm4Ano: Dataflow diagram. The human piktogram denotes variables and processes that need to be given by a user.
  • Figure 4: Accuracy comparison between models trained on anonymized data obtained using VGHs obtained by Agglomerative Hierarchical Clustering (left) and KMeans clustering (right).
  • Figure 5: Performance measures. The VGHs used for anonymization were generated using KMeans clustering.