Table of Contents
Fetching ...

Membership Inference on Word Embedding and Beyond

Saeed Mahloujifar, Huseyin A. Inan, Melissa Chase, Esha Ghosh, Marcello Hasegawa

TL;DR

This work shows that word embeddings are vulnerable to black-box membership inference attacks, with leakage persisting in downstream NLP tasks such as classification and text-generation even when the embedding layer is hidden. It introduces a two-phase MI attack on Word2Vec that leverages semantic relationships and restricted word-pair signals, achieving high accuracy (~90%) and transferring to classifiers and LSTM-based language models without demanding shadow models. The study also analyzes variations, robustness to distribution shifts, and defense considerations, highlighting the practical risk of privacy leakage in NLP pipelines. Overall, the paper provides both a new attack surface for NLP privacy and a call to explore defenses like differential privacy to mitigate leakage across embedding-based systems.

Abstract

In the text processing context, most ML models are built on word embeddings. These embeddings are themselves trained on some datasets, potentially containing sensitive data. In some cases this training is done independently, in other cases, it occurs as part of training a larger, task-specific model. In either case, it is of interest to consider membership inference attacks based on the embedding layer as a way of understanding sensitive information leakage. But, somewhat surprisingly, membership inference attacks on word embeddings and their effect in other natural language processing (NLP) tasks that use these embeddings, have remained relatively unexplored. In this work, we show that word embeddings are vulnerable to black-box membership inference attacks under realistic assumptions. Furthermore, we show that this leakage persists through two other major NLP applications: classification and text-generation, even when the embedding layer is not exposed to the attacker. We show that our MI attack achieves high attack accuracy against a classifier model and an LSTM-based language model. Indeed, our attack is a cheaper membership inference attack on text-generative models, which does not require the knowledge of the target model or any expensive training of text-generative models as shadow models.

Membership Inference on Word Embedding and Beyond

TL;DR

This work shows that word embeddings are vulnerable to black-box membership inference attacks, with leakage persisting in downstream NLP tasks such as classification and text-generation even when the embedding layer is hidden. It introduces a two-phase MI attack on Word2Vec that leverages semantic relationships and restricted word-pair signals, achieving high accuracy (~90%) and transferring to classifiers and LSTM-based language models without demanding shadow models. The study also analyzes variations, robustness to distribution shifts, and defense considerations, highlighting the practical risk of privacy leakage in NLP pipelines. Overall, the paper provides both a new attack surface for NLP privacy and a call to explore defenses like differential privacy to mitigate leakage across embedding-based systems.

Abstract

In the text processing context, most ML models are built on word embeddings. These embeddings are themselves trained on some datasets, potentially containing sensitive data. In some cases this training is done independently, in other cases, it occurs as part of training a larger, task-specific model. In either case, it is of interest to consider membership inference attacks based on the embedding layer as a way of understanding sensitive information leakage. But, somewhat surprisingly, membership inference attacks on word embeddings and their effect in other natural language processing (NLP) tasks that use these embeddings, have remained relatively unexplored. In this work, we show that word embeddings are vulnerable to black-box membership inference attacks under realistic assumptions. Furthermore, we show that this leakage persists through two other major NLP applications: classification and text-generation, even when the embedding layer is not exposed to the attacker. We show that our MI attack achieves high attack accuracy against a classifier model and an LSTM-based language model. Indeed, our attack is a cheaper membership inference attack on text-generative models, which does not require the knowledge of the target model or any expensive training of text-generative models as shadow models.

Paper Structure

This paper contains 42 sections, 5 figures, 5 tables, 5 algorithms.

Figures (5)

  • Figure 1: Security Experiment for attack against distribution of users
  • Figure 2: Security Experiment for attack against a user dataset
  • Figure 3: As numbers of shadow models increase, the attack becomes more successful in performing membership inference. In all setting, the attack seem to reach a plateau after 200 shadow models. Interestingly, the attack can success even when there is a single email in the target email set.
  • Figure 4: Success of the attack when the shadow models are trained on a proxy distribution (Avocado) instead of the original (Enron). Interestingly, the knowledge of the exact distribution is not crucial to the attack if enough number of shadow models are trained.
  • Figure 5: Success of attack when a fraction of target dataset is used in training.