Table of Contents
Fetching ...

Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience

Xin Yin, Chao Ni, Xiaodan Xu, Xinrui Li, Xiaohu Yang

TL;DR

This work tackles data scarcity for non-generative software tasks by using Large Language Models to generate domain-specific data that augment pre-trained LMs. The authors systematically compare eight LLMs and eight LMs across fault localization and code clone detection, showing substantial gains when LM training data is enriched with LLM-produced faults and clones. They introduce a semantic-based data selection strategy and task-specific evaluation, demonstrating improvements up to 58.36% in fault localization and 6.09% in clone detection, with encoder-only LMs benefiting markedly from generated data. The study also analyzes the trade-offs between fine-tuning large LLMs versus leveraging LMs, revealing that while LLMs offer data generation advantages, they incur higher computational costs and sometimes yield smaller performance gains, especially in clone detection. Overall, the work highlights a practical, data-driven pathway to boost LM effectiveness in software engineering tasks by leveraging LLM-generated training data and careful data curation.

Abstract

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub), these models can understand the patterns in source code and use these patterns to predict code properties. However, LLMs under few-shot learning perform poorly on non-generative tasks (e.g., fault localization and vulnerability localization), and fine-tuning LLMs is time-consuming and costly for end users and small organizations. Furthermore, the performance of fine-tuning LMs for non-generative tasks is impressive, yet it heavily depends on the amount and quality of data. As a result, the current lack of data and the high cost of collecting it in real-world scenarios further limit the applicability of LMs. In this paper, we leverage the powerful generation capabilities of LLMs to enhance pre-trained LMs. Specifically, we use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks. We conduct experiments by combining different LLMs in our generation phase and introducing various LMs to learn from the LLM-generated data. Then, we compare the performance of these LMs before and after learning the data. We find that LLM-generated data significantly enhances the performance of LMs. The improvement can reach up to 58.36% for fault localization and up to 6.09% for clone detection.

Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience

TL;DR

This work tackles data scarcity for non-generative software tasks by using Large Language Models to generate domain-specific data that augment pre-trained LMs. The authors systematically compare eight LLMs and eight LMs across fault localization and code clone detection, showing substantial gains when LM training data is enriched with LLM-produced faults and clones. They introduce a semantic-based data selection strategy and task-specific evaluation, demonstrating improvements up to 58.36% in fault localization and 6.09% in clone detection, with encoder-only LMs benefiting markedly from generated data. The study also analyzes the trade-offs between fine-tuning large LLMs versus leveraging LMs, revealing that while LLMs offer data generation advantages, they incur higher computational costs and sometimes yield smaller performance gains, especially in clone detection. Overall, the work highlights a practical, data-driven pathway to boost LM effectiveness in software engineering tasks by leveraging LLM-generated training data and careful data curation.

Abstract

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub), these models can understand the patterns in source code and use these patterns to predict code properties. However, LLMs under few-shot learning perform poorly on non-generative tasks (e.g., fault localization and vulnerability localization), and fine-tuning LLMs is time-consuming and costly for end users and small organizations. Furthermore, the performance of fine-tuning LMs for non-generative tasks is impressive, yet it heavily depends on the amount and quality of data. As a result, the current lack of data and the high cost of collecting it in real-world scenarios further limit the applicability of LMs. In this paper, we leverage the powerful generation capabilities of LLMs to enhance pre-trained LMs. Specifically, we use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks. We conduct experiments by combining different LLMs in our generation phase and introducing various LMs to learn from the LLM-generated data. Then, we compare the performance of these LMs before and after learning the data. We find that LLM-generated data significantly enhances the performance of LMs. The improvement can reach up to 58.36% for fault localization and up to 6.09% for clone detection.
Paper Structure (30 sections, 4 figures, 14 tables)

This paper contains 30 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Fine-tune LMs to learn from LLM-generated data
  • Figure 2: An example of prompt for fault generation
  • Figure 3: Average decrease (eight LMs) of FPR in fault localization (RQ3)
  • Figure 4: Average result (eight LMs) in fault localization when using ranking or random selection