Table of Contents
Fetching ...

Large Language Models as Robust Data Generators in Software Analytics: Are We There Yet?

Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

TL;DR

This work evaluates whether LLM-generated data can substitute human-written data for fine-tuning robust pre-trained transformers in software analytics under adversarial attacks. It conducts a comprehensive, cross-task comparison across clone detection, code summarization, and sentiment analysis using six PTMs, nine adversarial attacks, and multiple quality metrics. Findings show that while LLM-generated data can match or slightly surpass human-written data on natural-language tasks, human-written data consistently yields greater robustness to adversarial perturbations, especially for code-related tasks; adversarial-example quality is also higher for human-written data. The study highlights the need to improve the quality of LLM-generated data and informs data-generation strategies for robust software-analytic systems, suggesting a continued emphasis on human-authored data for security-critical applications and targeted defense improvements for LLM-based data.

Abstract

Large Language Model (LLM)-generated data is increasingly used in software analytics, but it is unclear how this data compares to human-written data, particularly when models are exposed to adversarial scenarios. Adversarial attacks can compromise the reliability and security of software systems, so understanding how LLM-generated data performs under these conditions, compared to human-written data, which serves as the benchmark for model performance, can provide valuable insights into whether LLM-generated data offers similar robustness and effectiveness. To address this gap, we systematically evaluate and compare the quality of human-written and LLM-generated data for fine-tuning robust pre-trained models (PTMs) in the context of adversarial attacks. We evaluate the robustness of six widely used PTMs, fine-tuned on human-written and LLM-generated data, before and after adversarial attacks. This evaluation employs nine state-of-the-art (SOTA) adversarial attack techniques across three popular software analytics tasks: clone detection, code summarization, and sentiment analysis in code review discussions. Additionally, we analyze the quality of the generated adversarial examples using eleven similarity metrics. Our findings reveal that while PTMs fine-tuned on LLM-generated data perform competitively with those fine-tuned on human-written data, they exhibit less robustness against adversarial attacks in software analytics tasks. Our study underscores the need for further exploration into enhancing the quality of LLM-generated training data to develop models that are both high-performing and capable of withstanding adversarial attacks in software analytics.

Large Language Models as Robust Data Generators in Software Analytics: Are We There Yet?

TL;DR

This work evaluates whether LLM-generated data can substitute human-written data for fine-tuning robust pre-trained transformers in software analytics under adversarial attacks. It conducts a comprehensive, cross-task comparison across clone detection, code summarization, and sentiment analysis using six PTMs, nine adversarial attacks, and multiple quality metrics. Findings show that while LLM-generated data can match or slightly surpass human-written data on natural-language tasks, human-written data consistently yields greater robustness to adversarial perturbations, especially for code-related tasks; adversarial-example quality is also higher for human-written data. The study highlights the need to improve the quality of LLM-generated data and informs data-generation strategies for robust software-analytic systems, suggesting a continued emphasis on human-authored data for security-critical applications and targeted defense improvements for LLM-based data.

Abstract

Large Language Model (LLM)-generated data is increasingly used in software analytics, but it is unclear how this data compares to human-written data, particularly when models are exposed to adversarial scenarios. Adversarial attacks can compromise the reliability and security of software systems, so understanding how LLM-generated data performs under these conditions, compared to human-written data, which serves as the benchmark for model performance, can provide valuable insights into whether LLM-generated data offers similar robustness and effectiveness. To address this gap, we systematically evaluate and compare the quality of human-written and LLM-generated data for fine-tuning robust pre-trained models (PTMs) in the context of adversarial attacks. We evaluate the robustness of six widely used PTMs, fine-tuned on human-written and LLM-generated data, before and after adversarial attacks. This evaluation employs nine state-of-the-art (SOTA) adversarial attack techniques across three popular software analytics tasks: clone detection, code summarization, and sentiment analysis in code review discussions. Additionally, we analyze the quality of the generated adversarial examples using eleven similarity metrics. Our findings reveal that while PTMs fine-tuned on LLM-generated data perform competitively with those fine-tuned on human-written data, they exhibit less robustness against adversarial attacks in software analytics tasks. Our study underscores the need for further exploration into enhancing the quality of LLM-generated training data to develop models that are both high-performing and capable of withstanding adversarial attacks in software analytics.

Paper Structure

This paper contains 28 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: A heatmap showing %ASR and AMQ metric values of fine-tuned models under adversarial attacks using human-written and LLM-generated data. Here, _._H & _._L denote the adversarial attacksperformed on PTMs fine-tuned on human-written and LLM-generated data, respectively.
  • Figure 2: Entropy change between original and adversarial examples after adversarial attacks on PTMs fine-tuned on human-written and LLM-generated data.