Table of Contents
Fetching ...

MaSS: Multi-attribute Selective Suppression for Utility-preserving Data Transformation from an Information-theoretic Perspective

Yizhuo Chen, Chun-Fu Chen, Hsiang Hsu, Shaohan Hu, Marco Pistoia, Tarek Abdelzaher

TL;DR

MaSS tackles information-theoretic privacy for multi-attribute data by formulating a constrained mutual-information optimization that maximizes $I(X';F)$ while limiting leakage from sensitive attributes and preserving annotated utilities. It introduces a differentiable data transformation framework with adversarial surrogates for annotated attributes and InfoNCE-based contrastive learning to handle unannotated features, enabling a unified, trainable objective. The approach is supported by formal operational bounds and validated across audio, motion, and facial image datasets, demonstrating strong suppression of sensitive information with maintained utility. The work provides open-source code and a principled pathway for utility-preserving privacy in diverse data modalities.

Abstract

The growing richness of large-scale datasets has been crucial in driving the rapid advancement and wide adoption of machine learning technologies. The massive collection and usage of data, however, pose an increasing risk for people's private and sensitive information due to either inadvertent mishandling or malicious exploitation. Besides legislative solutions, many technical approaches have been proposed towards data privacy protection. However, they bear various limitations such as leading to degraded data availability and utility, or relying on heuristics and lacking solid theoretical bases. To overcome these limitations, we propose a formal information-theoretic definition for this utility-preserving privacy protection problem, and design a data-driven learnable data transformation framework that is capable of selectively suppressing sensitive attributes from target datasets while preserving the other useful attributes, regardless of whether or not they are known in advance or explicitly annotated for preservation. We provide rigorous theoretical analyses on the operational bounds for our framework, and carry out comprehensive experimental evaluations using datasets of a variety of modalities, including facial images, voice audio clips, and human activity motion sensor signals. Results demonstrate the effectiveness and generalizability of our method under various configurations on a multitude of tasks. Our code is available at https://github.com/jpmorganchase/MaSS.

MaSS: Multi-attribute Selective Suppression for Utility-preserving Data Transformation from an Information-theoretic Perspective

TL;DR

MaSS tackles information-theoretic privacy for multi-attribute data by formulating a constrained mutual-information optimization that maximizes while limiting leakage from sensitive attributes and preserving annotated utilities. It introduces a differentiable data transformation framework with adversarial surrogates for annotated attributes and InfoNCE-based contrastive learning to handle unannotated features, enabling a unified, trainable objective. The approach is supported by formal operational bounds and validated across audio, motion, and facial image datasets, demonstrating strong suppression of sensitive information with maintained utility. The work provides open-source code and a principled pathway for utility-preserving privacy in diverse data modalities.

Abstract

The growing richness of large-scale datasets has been crucial in driving the rapid advancement and wide adoption of machine learning technologies. The massive collection and usage of data, however, pose an increasing risk for people's private and sensitive information due to either inadvertent mishandling or malicious exploitation. Besides legislative solutions, many technical approaches have been proposed towards data privacy protection. However, they bear various limitations such as leading to degraded data availability and utility, or relying on heuristics and lacking solid theoretical bases. To overcome these limitations, we propose a formal information-theoretic definition for this utility-preserving privacy protection problem, and design a data-driven learnable data transformation framework that is capable of selectively suppressing sensitive attributes from target datasets while preserving the other useful attributes, regardless of whether or not they are known in advance or explicitly annotated for preservation. We provide rigorous theoretical analyses on the operational bounds for our framework, and carry out comprehensive experimental evaluations using datasets of a variety of modalities, including facial images, voice audio clips, and human activity motion sensor signals. Results demonstrate the effectiveness and generalizability of our method under various configurations on a multitude of tasks. Our code is available at https://github.com/jpmorganchase/MaSS.
Paper Structure (33 sections, 1 theorem, 29 equations, 4 figures, 18 tables)

This paper contains 33 sections, 1 theorem, 29 equations, 4 figures, 18 tables.

Key Result

Theorem 3.1

For the Markov Chain shown in Figure fig:mc, there exists a solution to the optimization problem defined in Equation eq:problem, only if for any pair of $(m_i,n_j)$, $i \in 1\dots M$, $j \in 1\dots N$, it satisfies: Under the assumptions that $P(S_i|X)$ and $P(U_j|X)$ are degenerate distributions, Equation eq:mn can be simplified to where $H(\cdot)$ is Shannon entropy. Besides, for any $m_i$, $i

Figures (4)

  • Figure 1: An illustrative use case of MaSS: The original data sample is a voice clip of a person speaking a digit, where its attributes "gender" and "accent" are considered as sensitive, while its "age" and "spoken digit" are annotated as useful. We are also interested in preserving generic features of the data. For example, the voice clip may contain attributes such as "speaker ID" or "recording room" that could prove to be useful down the road, but are not necessarily explicitly annotated yet at the time of processing. After the transformation of MaSS, sensitive attributes can no longer be accurately inferred, but the other useful attributes are preserved in the transformed data.
  • Figure 2: The Markov chain of all variables. $F$ is correlated with $U,S,X$. $X'$ is only dependent on $X$.
  • Figure 3: The overall architecture of MaSS. The data transformation module converts the original data into a transformed version. Then the transformed data is sent to both the sensitive attributes suppression module and the annotated useful attributes preservation module, to calculate a relaxed suppression or preservation loss for each attribute respectively. Additionally, the original and transformed data are sent to the unannotated useful attributes preservation module to calculate a contrastive loss. Finally, these losses are aggregated to minimize $\theta$ and $\eta$ jointly. $\phi,\psi$ are optimized with traditional supervised learning.
  • Figure 4: The visualization of the original data and transformed data in Adience dataset. The first row presents the original facial images, while the second and third rows show the transformed images with gender and age suppressed respectively. Other attributes are preserved as unannotated.

Theorems & Definitions (2)

  • Theorem 3.1
  • proof