Table of Contents
Fetching ...

Better Knowledge Enhancement for Privacy-Preserving Cross-Project Defect Prediction

Yuying Wang, Yichen Li, Haozhao Wang, Lei Zhao, Xiaofang Zhang

TL;DR

The paper tackles privacy-preserving cross-project defect prediction under data heterogeneity by introducing FedDP, a federated framework that combines local heterogeneity awareness with global knowledge distillation using open-source data as the distillation dataset. FedDP is shown to outperform privacy-preserving CPDP baselines and approach centralized training performance while reducing communication rounds. The authors validate robustness across datasets and distillation choices, perform ablations, and discuss threats to validity and potential extensions to other software engineering tasks. The work offers a practical pathway to deploy privacy-preserving defect prediction in industry, leveraging open-source knowledge without exposing proprietary data.

Abstract

Cross-Project Defect Prediction (CPDP) poses a non-trivial challenge to construct a reliable defect predictor by leveraging data from other projects, particularly when data owners are concerned about data privacy. In recent years, Federated Learning (FL) has become an emerging paradigm to guarantee privacy information by collaborative training a global model among multiple parties without sharing raw data. While the direct application of FL to the CPDP task offers a promising solution to address privacy concerns, the data heterogeneity arising from proprietary projects across different companies or organizations will bring troubles for model training. In this paper, we study the privacy-preserving cross-project defect prediction with data heterogeneity under the federated learning framework. To address this problem, we propose a novel knowledge enhancement approach named FedDP with two simple but effective solutions: 1. Local Heterogeneity Awareness and 2. Global Knowledge Distillation. Specifically, we employ open-source project data as the distillation dataset and optimize the global model with the heterogeneity-aware local model ensemble via knowledge distillation. Experimental results on 19 projects from two datasets demonstrate that our method significantly outperforms baselines.

Better Knowledge Enhancement for Privacy-Preserving Cross-Project Defect Prediction

TL;DR

The paper tackles privacy-preserving cross-project defect prediction under data heterogeneity by introducing FedDP, a federated framework that combines local heterogeneity awareness with global knowledge distillation using open-source data as the distillation dataset. FedDP is shown to outperform privacy-preserving CPDP baselines and approach centralized training performance while reducing communication rounds. The authors validate robustness across datasets and distillation choices, perform ablations, and discuss threats to validity and potential extensions to other software engineering tasks. The work offers a practical pathway to deploy privacy-preserving defect prediction in industry, leveraging open-source knowledge without exposing proprietary data.

Abstract

Cross-Project Defect Prediction (CPDP) poses a non-trivial challenge to construct a reliable defect predictor by leveraging data from other projects, particularly when data owners are concerned about data privacy. In recent years, Federated Learning (FL) has become an emerging paradigm to guarantee privacy information by collaborative training a global model among multiple parties without sharing raw data. While the direct application of FL to the CPDP task offers a promising solution to address privacy concerns, the data heterogeneity arising from proprietary projects across different companies or organizations will bring troubles for model training. In this paper, we study the privacy-preserving cross-project defect prediction with data heterogeneity under the federated learning framework. To address this problem, we propose a novel knowledge enhancement approach named FedDP with two simple but effective solutions: 1. Local Heterogeneity Awareness and 2. Global Knowledge Distillation. Specifically, we employ open-source project data as the distillation dataset and optimize the global model with the heterogeneity-aware local model ensemble via knowledge distillation. Experimental results on 19 projects from two datasets demonstrate that our method significantly outperforms baselines.

Paper Structure

This paper contains 33 sections, 5 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of Federated Learning. Clients first update local models based on the distributed global model. Then, the updated local models are uploaded to the server and aggregated to obtain a global model for the next communication round.
  • Figure 2: Comparison of performance on three different open-source projects in terms of AUC and F1 values.
  • Figure 3: Overview of FedDP between a central server and clients. The server distributes the model and distillation data to each client (step 1). The client then trains its local model and identifies correlation factors between the local data and open-source data (steps 2-3). Clients upload the local model and the correlation factors to the server (step 4). The server aggregates the local models and performs knowledge distillation with correlation factors (steps 5-8).
  • Figure 4: The performance of FL-based CPDP methods under different Client Participation Ratio $R$.
  • Figure 5: The performance of FedDP on different Distillation Steps $N$ and Sampling Sizes $p$.