Table of Contents
Fetching ...

BLIA: Detect model memorization in binary classification model through passive Label Inference attack

Mohammad Wahiduzzaman Khan, Sheng Chen, Ilya Mironov, Leizhen Zhang, Rabib Noor

TL;DR

This work addresses the risk of label memorization in binary classifiers by introducing Binary Label Inference Attack (BLIA), a passive framework that infers training labels solely from pre-trained model outputs. The authors analyze two settings—without Label-DP and with randomized-response Label-DP—and evaluate memorization using canaries created by flipping 50% of canary labels, reporting that the attack success exceeds random guessing across datasets. BLIA comprises two passive attacks: a threshold-based method and a delta-margin method, with the latter consistently outperforming the former. Experimental results across six benchmarks reveal that label memorization persists even under strong privacy noise, highlighting limitations of Label-DP via randomized response and motivating the need for more robust privacy-preserving approaches. The work also provides a reproducible benchmarking framework to assess privacy-utility trade-offs in binary classification models.

Abstract

Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed "canaries," we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.

BLIA: Detect model memorization in binary classification model through passive Label Inference attack

TL;DR

This work addresses the risk of label memorization in binary classifiers by introducing Binary Label Inference Attack (BLIA), a passive framework that infers training labels solely from pre-trained model outputs. The authors analyze two settings—without Label-DP and with randomized-response Label-DP—and evaluate memorization using canaries created by flipping 50% of canary labels, reporting that the attack success exceeds random guessing across datasets. BLIA comprises two passive attacks: a threshold-based method and a delta-margin method, with the latter consistently outperforming the former. Experimental results across six benchmarks reveal that label memorization persists even under strong privacy noise, highlighting limitations of Label-DP via randomized response and motivating the need for more robust privacy-preserving approaches. The work also provides a reproducible benchmarking framework to assess privacy-utility trade-offs in binary classification models.

Abstract

Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed "canaries," we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.

Paper Structure

This paper contains 25 sections, 1 theorem, 7 equations, 2 figures, 8 tables, 2 algorithms.

Key Result

Lemma 3.1

Let $D'$ be a dataset where 50% of the labels are randomly flipped, making $y_i'$ independent of $x_i$. If a label inference attack achieves success rate $SR > 0.5$, the model $f$ memorizes labels in $D'$.

Figures (2)

  • Figure 1: Binary label inference attack(BLIA) to detect model memorization
  • Figure 2: Metrics vs. $\epsilon$ for all benchmarks. Panels (a)-(f) show train accuracy,threshold based attack success and delta margin attack success

Theorems & Definitions (6)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Lemma 3.1
  • proof