Table of Contents
Fetching ...

Learning Defect Prediction from Unrealistic Data

Kamel Alrashedy, Vincent J. Hellendoorn, Alessandro Orso

TL;DR

The paper tackles the mismatch between training on large, unrealistic defect datasets and performance on real-world defects. It introduces a method to identify the most realistic subset of unrealistics by embedding both real and synthetic programs into a shared contextual space and selecting samples close to real-world examples via distance-based filtering, followed by pretraining on this subset and fine-tuning on real data. Empirical results on vulnerability and bug prediction show that using a small, representative portion (often 10–25%) of the unrealistic data yields consistent improvements over baselines, sometimes outperforming training on the full unrealistics dataset, while full pretraining on unrealistics can hurt real-world performance. The findings suggest that careful data curation can enable leveraging large synthetic data without sacrificing downstream effectiveness, though realism is still constrained by data generation methods and current task difficulty.

Abstract

Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world applications

Learning Defect Prediction from Unrealistic Data

TL;DR

The paper tackles the mismatch between training on large, unrealistic defect datasets and performance on real-world defects. It introduces a method to identify the most realistic subset of unrealistics by embedding both real and synthetic programs into a shared contextual space and selecting samples close to real-world examples via distance-based filtering, followed by pretraining on this subset and fine-tuning on real data. Empirical results on vulnerability and bug prediction show that using a small, representative portion (often 10–25%) of the unrealistic data yields consistent improvements over baselines, sometimes outperforming training on the full unrealistics dataset, while full pretraining on unrealistics can hurt real-world performance. The findings suggest that careful data curation can enable leveraging large synthetic data without sacrificing downstream effectiveness, though realism is still constrained by data generation methods and current task difficulty.

Abstract

Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world applications
Paper Structure (25 sections, 4 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Schematic depiction of the motivation for our work. Unrealistic datasets of bugs, $D_u$, are often far larger than datasets of real-world bugs, $D_r$, and contain many programs that are not representative of such real-world bugs.
  • Figure 2: Overview of our approach, which takes as input an unrealistic dataset, a real-world dataset, and a generically pretrained model, and produces as output a fine-tuned version of the pretrained model. To do so, it (1) converts the programs in the two datasets into a contextual embedding, (2) uses the distance between elements in the embedding to identify a subset of programs in the unrealistic dataset that are most similar to programs in the realistic one, and (3) uses the identified subset plus the realistic dataset to fine-tune the pretrained model.
  • Figure 3: High-dimensional embeddings of both real-world and unrealistic programs using T-SNE.
  • Figure 4: Euclidean distance (x-axis) and total number (y-axis) of functions of each set of unrealistic data to their nearest real-world data counterparts. We highlight the 10th, 25th, 50th and 75th percentile, for each of which we train models in our evaluation. Note that the most distant samples are not visualized, as these are up to four times as distant as those shown here.