Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

Sheng Zhang; Hui Li

Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

Sheng Zhang, Hui Li

TL;DR

Code Membership Inference targets unauthorized training-data use in code pre-trained language models by introducing Code Membership Inference (CMI) and a practical tool, Buzzer. Buzzer combines signal extraction from pre-training tasks, calibration to handle hard-to-learn samples, and weighted inference to distinguish member from non-member data under white-box and black-box settings. Experiments across CodeBERT, CodeT5, DeepseekCoder, and CodeLlama show high AUC, with larger CPLMs delivering stronger signals and black-box performance remaining competitive, supporting IP-protection applications. The work offers a concrete auditing framework for CPLMs and points to future extensions to larger modalities and stronger generalization capabilities.

Abstract

Code pre-trained language models (CPLMs) have received great attention since they can benefit various tasks that facilitate software development and maintenance. However, CPLMs are trained on massive open-source code, raising concerns about potential data infringement. This paper launches the study of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference (CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer deploys several inference techniques, including signal extraction from pre-training tasks, hard-to-learn sample calibration and weighted inference, to identify code membership status accurately. Extensive experiments show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer can serve as a CMI tool and help protect intellectual property rights.

Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

TL;DR

Abstract

Paper Structure (27 sections, 4 equations, 4 figures, 4 tables)

This paper contains 27 sections, 4 equations, 4 figures, 4 tables.

Introduction
Related Work
Code Pre-trained Language Model
Membership Inference
Code Membership Inference in CPLMs
Task Definition
Knowledge Level
Our Proposed Buzzer
Overview of Two Types of CMI
Signal Extractor
Calibration Model
Weighted Inference Model
Experiments
Settings
Evaluation Metrics
...and 12 more sections

Figures (4)

Figure 1: Overview of our Buzzer framework. Firstly, it samples three disjoint datasets, $D_t$, $D_s$ and $D_c$, to construct target, shadow and calibrated models, respectively. After that, it extracts model signals with calibration and trains white-box and black-box classifiers for CMI.
Figure 2: Overview of the signal extractor. It returns signals w.r.t. the pre-training tasks.
Figure 3: Impact of different code features.
Figure 4: Impact of calibration.

Theorems & Definitions (1)

Definition 1: Code Membership Inference

Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

TL;DR

Abstract

Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (1)