Table of Contents
Fetching ...

Effective Code Membership Inference for Code Completion Models via Adversarial Prompts

Yuan Jiang, Zehao Li, Shan Huang, Christoph Treude, Xiaohong Su, Tiantian Wang

TL;DR

This paper tackles privacy risks in code completion LLMs by proposing AdvPrompt-MIA, a gray-box membership inference framework that uses semantics-preserving adversarial prompts to elicit memorization signals. It constructs 27-dimensional feature vectors from model outputs under 11 perturbations and learns a binary classifier to distinguish member from non-member samples, achieving strong AUC gains over baselines on APPS and HumanEval with Code Llama 7B. The approach generalizes across multiple code LLMs and transfers well between models and datasets, even with limited or no knowledge of the victim's training data. The work demonstrates that perturbation-induced behavioral consistency differences are robust indicators of memorization, offering a practical tool for auditing and improving privacy in code-language models.

Abstract

Membership inference attacks (MIAs) on code completion models offer an effective way to assess privacy risks by inferring whether a given code snippet was part of the training data. Existing black- and gray-box MIAs rely on expensive surrogate models or manually crafted heuristic rules, which limit their ability to capture the nuanced memorization patterns exhibited by over-parameterized code language models. To address these challenges, we propose AdvPrompt-MIA, a method specifically designed for code completion models, combining code-specific adversarial perturbations with deep learning. The core novelty of our method lies in designing a series of adversarial prompts that induce variations in the victim code model's output. By comparing these outputs with the ground-truth completion, we construct feature vectors to train a classifier that automatically distinguishes member from non-member samples. This design allows our method to capture richer memorization patterns and accurately infer training set membership. We conduct comprehensive evaluations on widely adopted models, such as Code Llama 7B, over the APPS and HumanEval benchmarks. The results show that our approach consistently outperforms state-of-the-art baselines, with AUC gains of up to 102%. In addition, our method exhibits strong transferability across different models and datasets, underscoring its practical utility and generalizability.

Effective Code Membership Inference for Code Completion Models via Adversarial Prompts

TL;DR

This paper tackles privacy risks in code completion LLMs by proposing AdvPrompt-MIA, a gray-box membership inference framework that uses semantics-preserving adversarial prompts to elicit memorization signals. It constructs 27-dimensional feature vectors from model outputs under 11 perturbations and learns a binary classifier to distinguish member from non-member samples, achieving strong AUC gains over baselines on APPS and HumanEval with Code Llama 7B. The approach generalizes across multiple code LLMs and transfers well between models and datasets, even with limited or no knowledge of the victim's training data. The work demonstrates that perturbation-induced behavioral consistency differences are robust indicators of memorization, offering a practical tool for auditing and improving privacy in code-language models.

Abstract

Membership inference attacks (MIAs) on code completion models offer an effective way to assess privacy risks by inferring whether a given code snippet was part of the training data. Existing black- and gray-box MIAs rely on expensive surrogate models or manually crafted heuristic rules, which limit their ability to capture the nuanced memorization patterns exhibited by over-parameterized code language models. To address these challenges, we propose AdvPrompt-MIA, a method specifically designed for code completion models, combining code-specific adversarial perturbations with deep learning. The core novelty of our method lies in designing a series of adversarial prompts that induce variations in the victim code model's output. By comparing these outputs with the ground-truth completion, we construct feature vectors to train a classifier that automatically distinguishes member from non-member samples. This design allows our method to capture richer memorization patterns and accurately infer training set membership. We conduct comprehensive evaluations on widely adopted models, such as Code Llama 7B, over the APPS and HumanEval benchmarks. The results show that our approach consistently outperforms state-of-the-art baselines, with AUC gains of up to 102%. In addition, our method exhibits strong transferability across different models and datasets, underscoring its practical utility and generalizability.

Paper Structure

This paper contains 39 sections, 14 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: An illustrative example showing the code pair $(x, y)$, along with the model’s original and perturbed outputs when $(x, y)$ is a memorized sample (middle column) and when it is a non-memorized sample (right column).
  • Figure 2: Workflow of the proposed MIA framework, AdvPrompt-MIA
  • Figure 3: Four strategies for constructing IDC branches
  • Figure 4: Two forms of the IRV transformation
  • Figure 5: Two forms of the IDP transformation
  • ...and 7 more figures