Table of Contents
Fetching ...

Combing for Credentials: Active Pattern Extraction from Smart Reply

Bargav Jayaraman, Esha Ghosh, Melissa Chase, Sambuddha Roy, Wei Dai, David Evans

TL;DR

This paper analyzes privacy risks in industry-scale Smart Reply systems by introducing a pattern-extraction attack that relies on active poisoning of fine-tuning data and canonical data patterns. It evaluates two practical threat models—Service API and Model API—and demonstrates that targeted leakage of emails, passwords, and credentials can occur with only a handful of queries, even when data scrubbing is used. While early stopping modestly reduces leakage, differential privacy provides strong defense against the attack, albeit with trade-offs in model perplexity. The work highlights actionable privacy implications for deploying transformer-based Smart Reply solutions and informs defense choices for real-world systems.

Abstract

Pre-trained large language models, such as GPT\nobreakdash-2 and BERT, are often fine-tuned to achieve state-of-the-art performance on a downstream task. One natural example is the ``Smart Reply'' application where a pre-trained model is tuned to provide suggested responses for a given query message. Since the tuning data is often sensitive data such as emails or chat transcripts, it is important to understand and mitigate the risk that the model leaks its tuning data. We investigate potential information leakage vulnerabilities in a typical Smart Reply pipeline. We consider a realistic setting where the adversary can only interact with the underlying model through a front-end interface that constrains what types of queries can be sent to the model. Previous attacks do not work in these settings, but require the ability to send unconstrained queries directly to the model. Even when there are no constraints on the queries, previous attacks typically require thousands, or even millions, of queries to extract useful information, while our attacks can extract sensitive data in just a handful of queries. We introduce a new type of active extraction attack that exploits canonical patterns in text containing sensitive data. We show experimentally that it is possible for an adversary to extract sensitive user information present in the training data, even in realistic settings where all interactions with the model must go through a front-end that limits the types of queries. We explore potential mitigation strategies and demonstrate empirically how differential privacy appears to be a reasonably effective defense mechanism to such pattern extraction attacks.

Combing for Credentials: Active Pattern Extraction from Smart Reply

TL;DR

This paper analyzes privacy risks in industry-scale Smart Reply systems by introducing a pattern-extraction attack that relies on active poisoning of fine-tuning data and canonical data patterns. It evaluates two practical threat models—Service API and Model API—and demonstrates that targeted leakage of emails, passwords, and credentials can occur with only a handful of queries, even when data scrubbing is used. While early stopping modestly reduces leakage, differential privacy provides strong defense against the attack, albeit with trade-offs in model perplexity. The work highlights actionable privacy implications for deploying transformer-based Smart Reply solutions and informs defense choices for real-world systems.

Abstract

Pre-trained large language models, such as GPT\nobreakdash-2 and BERT, are often fine-tuned to achieve state-of-the-art performance on a downstream task. One natural example is the ``Smart Reply'' application where a pre-trained model is tuned to provide suggested responses for a given query message. Since the tuning data is often sensitive data such as emails or chat transcripts, it is important to understand and mitigate the risk that the model leaks its tuning data. We investigate potential information leakage vulnerabilities in a typical Smart Reply pipeline. We consider a realistic setting where the adversary can only interact with the underlying model through a front-end interface that constrains what types of queries can be sent to the model. Previous attacks do not work in these settings, but require the ability to send unconstrained queries directly to the model. Even when there are no constraints on the queries, previous attacks typically require thousands, or even millions, of queries to extract useful information, while our attacks can extract sensitive data in just a handful of queries. We introduce a new type of active extraction attack that exploits canonical patterns in text containing sensitive data. We show experimentally that it is possible for an adversary to extract sensitive user information present in the training data, even in realistic settings where all interactions with the model must go through a front-end that limits the types of queries. We explore potential mitigation strategies and demonstrate empirically how differential privacy appears to be a reasonably effective defense mechanism to such pattern extraction attacks.
Paper Structure (22 sections, 2 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 2 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: Smart Reply Scenario
  • Figure 2: Comparing the training and validation set perplexities of GPT-2 and Bert2Bert Smart Reply models trained on 100,000 message--response pairs from Reddit data set.
  • Figure 3: Comparing the effect of language model output decoding strategies on the Service API attack success in extracting passwords. Each password SSD is inserted 10 times in the training set. The figure shows (mean $\pm$ std) for randomized sampling since it is a non-deterministic decoding technique. All the other decoding methods are deterministic.
  • Figure 4: SSD (mean $\pm$ std) extracted by Service API attack with varying number of queries to the model. Each SSD is inserted 10 times in the training set. Model output decoding is done via randomized sampling.
  • Figure 5: SSD (mean $\pm$ std) extracted by Service API attack (with 20 queries) with varying SSD insertion frequency. Output decoding is done via randomized sampling. The attack fails to extract when the SSD are inserted once in the training set.
  • ...and 8 more figures