Table of Contents
Fetching ...

An Empirical Study on Noisy Label Learning for Program Understanding

Wenhan Wang, Yanzhou Li, Anran Li, Jian Zhang, Wei Ma, Yang Liu

TL;DR

The paper tackles noisy label learning in program understanding by conducting an empirical study across classification and generation tasks, using both synthetic and real-world label noises. It evaluates multiple NLL approaches (e.g., TracIn, Co-teaching, RobustTrainer, Confident Learning, Simifeat) on small trained-from-scratch models and large pre-trained models across three tasks: program classification, vulnerability detection, and code summarization. Key findings show that large pre-trained models are robust to label noise, while small models benefit significantly from NLL; flip-noise is more damaging than random noise, and NLL effectiveness diminishes in real-world noisy scenarios, especially for generation tasks. The work provides actionable insights for dataset quality management in software engineering and suggests future directions toward using NLL for data curation and for broader SE tasks beyond program understanding.

Abstract

Recently, deep learning models have been widely applied in program understanding tasks, and these models achieve state-of-the-art results on many benchmark datasets. A major challenge of deep learning for program understanding is that the effectiveness of these approaches depends on the quality of their datasets, and these datasets often contain noisy data samples. A typical kind of noise in program understanding datasets is label noise, which means that the target outputs for some inputs are incorrect. Researchers have proposed various approaches to alleviate the negative impact of noisy labels, and formed a new research topic: noisy label learning (NLL). In this paper, we conduct an empirical study on the effectiveness of noisy label learning on deep learning for program understanding datasets. We evaluate various NLL approaches and deep learning models on three tasks: program classification, vulnerability detection, and code summarization. From the evaluation results, we come to the following findings: 1) small trained-from-scratch models are prone to label noises in program understanding, while large pre-trained models are highly robust against them. 2) NLL approaches significantly improve the program classification accuracies for small models on noisy training sets, but they only slightly benefit large pre-trained models in classification accuracies. 3) NLL can effectively detect synthetic noises in program understanding, but struggle in detecting real-world noises. We believe our findings can provide insights on the abilities of NLL in program understanding, and shed light on future works in tackling noises in software engineering datasets. We have released our code at https://github.com/jacobwwh/noise_SE.

An Empirical Study on Noisy Label Learning for Program Understanding

TL;DR

The paper tackles noisy label learning in program understanding by conducting an empirical study across classification and generation tasks, using both synthetic and real-world label noises. It evaluates multiple NLL approaches (e.g., TracIn, Co-teaching, RobustTrainer, Confident Learning, Simifeat) on small trained-from-scratch models and large pre-trained models across three tasks: program classification, vulnerability detection, and code summarization. Key findings show that large pre-trained models are robust to label noise, while small models benefit significantly from NLL; flip-noise is more damaging than random noise, and NLL effectiveness diminishes in real-world noisy scenarios, especially for generation tasks. The work provides actionable insights for dataset quality management in software engineering and suggests future directions toward using NLL for data curation and for broader SE tasks beyond program understanding.

Abstract

Recently, deep learning models have been widely applied in program understanding tasks, and these models achieve state-of-the-art results on many benchmark datasets. A major challenge of deep learning for program understanding is that the effectiveness of these approaches depends on the quality of their datasets, and these datasets often contain noisy data samples. A typical kind of noise in program understanding datasets is label noise, which means that the target outputs for some inputs are incorrect. Researchers have proposed various approaches to alleviate the negative impact of noisy labels, and formed a new research topic: noisy label learning (NLL). In this paper, we conduct an empirical study on the effectiveness of noisy label learning on deep learning for program understanding datasets. We evaluate various NLL approaches and deep learning models on three tasks: program classification, vulnerability detection, and code summarization. From the evaluation results, we come to the following findings: 1) small trained-from-scratch models are prone to label noises in program understanding, while large pre-trained models are highly robust against them. 2) NLL approaches significantly improve the program classification accuracies for small models on noisy training sets, but they only slightly benefit large pre-trained models in classification accuracies. 3) NLL can effectively detect synthetic noises in program understanding, but struggle in detecting real-world noises. We believe our findings can provide insights on the abilities of NLL in program understanding, and shed light on future works in tackling noises in software engineering datasets. We have released our code at https://github.com/jacobwwh/noise_SE.
Paper Structure (15 sections, 6 figures, 8 tables)

This paper contains 15 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An overview of our study.
  • Figure 2: Examples for code summarization data samples with each score. (a): a typical example with score=2. The summary clearly describes the functionality of the code snippet. (b): an example with score=1. The summary describes the code, but it requires additional context (e.g., API doc) to map the code to the summary. Moreover, the summary contains a large amount of non-functional descriptions. (c): an example with score=0. The summary is irrelevant to the code snippet's functionality. Note that this sample cannot be detected by the rule-based approach shi2022we.
  • Figure 3: Histogram of the average human evaluation scores on a small subset of TLC.
  • Figure 4: The training loss and validation accuracy during training on 50% random label noise. (a): LSTM. (b): CodeBERT.
  • Figure 5: A comparison between NLL approaches (TracIn and loss) and human evaluation on the TLC code summarization test subset. (a): Losses. (b): TracIn scores.
  • ...and 1 more figures