Table of Contents
Fetching ...

Improved Membership Inference Attacks Against Language Classification Models

Shlomit Shachor, Natalia Razinkov, Abigail Goldsteen

TL;DR

The paper addresses privacy risks in ML by focusing on membership inference attacks and introduces an ensemble framework that partitions data into non-overlapping subsets to train many small, specialized MI attack models. These attacks are generated via a grid-search over attack architectures, input features, and scaling methods, and their results are averaged across subset pairs and multiple instances to yield a robust leakage estimate. The approach consistently outperforms single-attack and per-class baselines across classical and language classification tasks, including models protected by differential privacy, with improvements up to $14$ percent in leakage metrics. It also notes that generation prompts can significantly influence MI success in large language models and points to future work extending the framework to other privacy threats and non-classification targets.

Abstract

Artificial intelligence systems are prevalent in everyday life, with use cases in retail, manufacturing, health, and many other fields. With the rise in AI adoption, associated risks have been identified, including privacy risks to the people whose data was used to train models. Assessing the privacy risks of machine learning models is crucial to enabling knowledgeable decisions on whether to use, deploy, or share a model. A common approach to privacy risk assessment is to run one or more known attacks against the model and measure their success rate. We present a novel framework for running membership inference attacks against classification models. Our framework takes advantage of the ensemble method, generating many specialized attack models for different subsets of the data. We show that this approach achieves higher accuracy than either a single attack model or an attack model per class label, both on classical and language classification tasks.

Improved Membership Inference Attacks Against Language Classification Models

TL;DR

The paper addresses privacy risks in ML by focusing on membership inference attacks and introduces an ensemble framework that partitions data into non-overlapping subsets to train many small, specialized MI attack models. These attacks are generated via a grid-search over attack architectures, input features, and scaling methods, and their results are averaged across subset pairs and multiple instances to yield a robust leakage estimate. The approach consistently outperforms single-attack and per-class baselines across classical and language classification tasks, including models protected by differential privacy, with improvements up to percent in leakage metrics. It also notes that generation prompts can significantly influence MI success in large language models and points to future work extending the framework to other privacy threats and non-classification targets.

Abstract

Artificial intelligence systems are prevalent in everyday life, with use cases in retail, manufacturing, health, and many other fields. With the rise in AI adoption, associated risks have been identified, including privacy risks to the people whose data was used to train models. Assessing the privacy risks of machine learning models is crucial to enabling knowledgeable decisions on whether to use, deploy, or share a model. A common approach to privacy risk assessment is to run one or more known attacks against the model and measure their success rate. We present a novel framework for running membership inference attacks against classification models. Our framework takes advantage of the ensemble method, generating many specialized attack models for different subsets of the data. We show that this approach achieves higher accuracy than either a single attack model or an attack model per class label, both on classical and language classification tasks.
Paper Structure (9 sections, 2 figures, 2 tables)

This paper contains 9 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: High-level overview of the framework for small specialized MIA models.
  • Figure 2: Comparative results between single attack and multiple small attacks across models and datasets. Blue lines represent single attack, green lines represent many attacks. Each pair of adjacent lines represent the same experiment: both classes together (01), and per class (0 or 1 respectively).