Table of Contents
Fetching ...

Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models

Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, Michael R. Lyu

TL;DR

HINT addresses the challenge of limited labeled data for downstream code tasks by leveraging large-scale unlabeled data through pseudo-labeling. It introduces two key components: HybrId pseudo-labeled data selection, combining loss-based filtering with a retrieval-based similarity check, and Noise-tolerant Training, which uses a symmetric cross-entropy loss plus consistency regularization to mitigate label noise. The framework, validated on CodeBERT, CodeT5, and UniXcoder across code summarization, defect detection, and assertion generation, yields consistent improvements and strong cross-domain transfer, demonstrating effective task-specific use of unlabeled data. This approach offers a practical path to enhance real-world code intelligence systems by better exploiting unlabeled data during tuning while maintaining robustness to noise.

Abstract

Pre-trained code models have recently achieved substantial improvements in many code intelligence tasks. These models are first pre-trained on large-scale unlabeled datasets in a task-agnostic manner using self-supervised learning, and then fine-tuned on labeled datasets in downstream tasks. However, the labeled datasets are usually limited in size (i.e., human intensive efforts), which may hinder the performance of pre-trained code models in specific tasks. To mitigate this, one possible solution is to leverage the large-scale unlabeled data in the tuning stage by pseudo-labeling. However, directly employing the pseudo-labeled data can bring a large amount of noise, i.e., incorrect labels, leading to suboptimal performance. How to effectively leverage the noisy pseudo-labeled data is a challenging yet under-explored problem.In this paper, we propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets by better utilizing the pseudo-labeled data. HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training. In the hybrid pseudo-data selection module, considering the robustness issue, apart from directly measuring the quality of pseudo labels through training loss, we further propose to employ a retrieval-based method to filter low-quality pseudo-labeled data. The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions.The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively.

Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models

TL;DR

HINT addresses the challenge of limited labeled data for downstream code tasks by leveraging large-scale unlabeled data through pseudo-labeling. It introduces two key components: HybrId pseudo-labeled data selection, combining loss-based filtering with a retrieval-based similarity check, and Noise-tolerant Training, which uses a symmetric cross-entropy loss plus consistency regularization to mitigate label noise. The framework, validated on CodeBERT, CodeT5, and UniXcoder across code summarization, defect detection, and assertion generation, yields consistent improvements and strong cross-domain transfer, demonstrating effective task-specific use of unlabeled data. This approach offers a practical path to enhance real-world code intelligence systems by better exploiting unlabeled data during tuning while maintaining robustness to noise.

Abstract

Pre-trained code models have recently achieved substantial improvements in many code intelligence tasks. These models are first pre-trained on large-scale unlabeled datasets in a task-agnostic manner using self-supervised learning, and then fine-tuned on labeled datasets in downstream tasks. However, the labeled datasets are usually limited in size (i.e., human intensive efforts), which may hinder the performance of pre-trained code models in specific tasks. To mitigate this, one possible solution is to leverage the large-scale unlabeled data in the tuning stage by pseudo-labeling. However, directly employing the pseudo-labeled data can bring a large amount of noise, i.e., incorrect labels, leading to suboptimal performance. How to effectively leverage the noisy pseudo-labeled data is a challenging yet under-explored problem.In this paper, we propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets by better utilizing the pseudo-labeled data. HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training. In the hybrid pseudo-data selection module, considering the robustness issue, apart from directly measuring the quality of pseudo labels through training loss, we further propose to employ a retrieval-based method to filter low-quality pseudo-labeled data. The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions.The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively.
Paper Structure (33 sections, 4 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 4 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Examples in the code summarization task for illustrating the motivation of the hybrid pseudo-labeled data selection method, which indicates the loss-based data selection strategy alone may incorrectly measure the quality of pseudo labels.
  • Figure 2: The overview of HINT.
  • Figure 3: An example of a pseudo label with a minor error.
  • Figure 4: Parameter analysis on threshold $K$.
  • Figure 5: Parameter analysis on $t$ and $\mu$.
  • ...and 5 more figures