Table of Contents
Fetching ...

Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques

Ratnadira Widyasari, David Lo, Lizi Liao

TL;DR

This paper conducts a comprehensive investigation into the capabilities of several LLMs across two SQA tasks: fault localization and vulnerability detection and introduces a cross-validation approach to refine the LLM answer by validating one LLM answer against another using a validation prompt.

Abstract

With the advancement of Large Language Models (LLMs), their application in Software Quality Assurance (SQA) has increased. However, the current focus of these applications is predominantly on ChatGPT. There remains a gap in understanding the performance of various LLMs in this critical domain. This paper aims to address this gap by conducting a comprehensive investigation into the capabilities of several LLMs across two SQA tasks: fault localization and vulnerability detection. We conducted comparative studies using GPT-3.5, GPT-4o, and four other publicly available LLMs (LLaMA-3-70B, LLaMA-3-8B, Gemma-7B, and Mixtral-8x7B), to evaluate their effectiveness in these tasks. Our findings reveal that several LLMs can outperform GPT-3.5 in both tasks. Additionally, even the lower-performing LLMs provided unique correct predictions, suggesting the potential of combining different LLMs' results to enhance overall performance. By implementing a voting mechanism to combine the LLMs' results, we achieved more than a 10% improvement over the GPT-3.5 in both tasks. Furthermore, we introduced a cross-validation approach to refine the LLM answer by validating one LLM answer against another using a validation prompt. This approach led to performance improvements of 16% in fault localization and 12% in vulnerability detection compared to the GPT-3.5, with a 4% improvement compared to the best-performed LLMs. Our analysis also indicates that the inclusion of explanations in the LLMs' results affects the effectiveness of the cross-validation technique.

Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques

TL;DR

This paper conducts a comprehensive investigation into the capabilities of several LLMs across two SQA tasks: fault localization and vulnerability detection and introduces a cross-validation approach to refine the LLM answer by validating one LLM answer against another using a validation prompt.

Abstract

With the advancement of Large Language Models (LLMs), their application in Software Quality Assurance (SQA) has increased. However, the current focus of these applications is predominantly on ChatGPT. There remains a gap in understanding the performance of various LLMs in this critical domain. This paper aims to address this gap by conducting a comprehensive investigation into the capabilities of several LLMs across two SQA tasks: fault localization and vulnerability detection. We conducted comparative studies using GPT-3.5, GPT-4o, and four other publicly available LLMs (LLaMA-3-70B, LLaMA-3-8B, Gemma-7B, and Mixtral-8x7B), to evaluate their effectiveness in these tasks. Our findings reveal that several LLMs can outperform GPT-3.5 in both tasks. Additionally, even the lower-performing LLMs provided unique correct predictions, suggesting the potential of combining different LLMs' results to enhance overall performance. By implementing a voting mechanism to combine the LLMs' results, we achieved more than a 10% improvement over the GPT-3.5 in both tasks. Furthermore, we introduced a cross-validation approach to refine the LLM answer by validating one LLM answer against another using a validation prompt. This approach led to performance improvements of 16% in fault localization and 12% in vulnerability detection compared to the GPT-3.5, with a 4% improvement compared to the best-performed LLMs. Our analysis also indicates that the inclusion of explanations in the LLMs' results affects the effectiveness of the cross-validation technique.
Paper Structure (31 sections, 3 equations, 5 figures, 8 tables)

This paper contains 31 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of the proposed approach on cross-validating GPT-4o and LLaMA-3-70B
  • Figure 2: Validation prompt
  • Figure 3: Intersection diagram showing successfully localized faults at Top-1
  • Figure 4: Intersection diagram showing correct predictions in vulnerability detection\ref{['footnote:1']}
  • Figure 5: Venn diagram of cross-validation prompt result with the default initial prompt.