Table of Contents
Fetching ...

LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization

Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, Partha Pratim Chakrabarti

TL;DR

The paper addresses the cost and fatigue of human evaluation in bug report summarization by testing Large Language Models (GPT-4o, Llama-3, Gemini) as automated evaluators for two tasks: bug title selection and bug report summary evaluation. It details a replication-friendly experimental setup with standardized instructions and an accuracy-based metric to compare humans and three LLMs. Results show GPT-4o generally yields the highest accuracy, while NOTA items present notable challenges for all evaluators, and human fatigue becomes evident as task difficulty increases. The work demonstrates the potential for scalable, automated evaluation of software summaries and outlines directions for broader validation and dataset expansion.

Abstract

Summarizing software artifacts is an important task that has been thoroughly researched. For evaluating software summarization approaches, human judgment is still the most trusted evaluation. However, it is time-consuming and fatiguing for evaluators, making it challenging to scale and reproduce. Large Language Models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks, motivating us to explore their potential as automatic evaluators for approaches that aim to summarize software artifacts. In this study, we investigate whether LLMs can evaluate bug report summarization effectively. We conducted an experiment in which we presented the same set of bug summarization problems to humans and three LLMs (GPT-4o, LLaMA-3, and Gemini) for evaluation on two tasks: selecting the correct bug report title and bug report summary from a set of options. Our results show that LLMs performed generally well in evaluating bug report summaries, with GPT-4o outperforming the other LLMs. Additionally, both humans and LLMs showed consistent decision-making, but humans experienced fatigue, impacting their accuracy over time. Our results indicate that LLMs demonstrate potential for being considered as automated evaluators for bug report summarization, which could allow scaling up evaluations while reducing human evaluators effort and fatigue.

LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization

TL;DR

The paper addresses the cost and fatigue of human evaluation in bug report summarization by testing Large Language Models (GPT-4o, Llama-3, Gemini) as automated evaluators for two tasks: bug title selection and bug report summary evaluation. It details a replication-friendly experimental setup with standardized instructions and an accuracy-based metric to compare humans and three LLMs. Results show GPT-4o generally yields the highest accuracy, while NOTA items present notable challenges for all evaluators, and human fatigue becomes evident as task difficulty increases. The work demonstrates the potential for scalable, automated evaluation of software summaries and outlines directions for broader validation and dataset expansion.

Abstract

Summarizing software artifacts is an important task that has been thoroughly researched. For evaluating software summarization approaches, human judgment is still the most trusted evaluation. However, it is time-consuming and fatiguing for evaluators, making it challenging to scale and reproduce. Large Language Models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks, motivating us to explore their potential as automatic evaluators for approaches that aim to summarize software artifacts. In this study, we investigate whether LLMs can evaluate bug report summarization effectively. We conducted an experiment in which we presented the same set of bug summarization problems to humans and three LLMs (GPT-4o, LLaMA-3, and Gemini) for evaluation on two tasks: selecting the correct bug report title and bug report summary from a set of options. Our results show that LLMs performed generally well in evaluating bug report summaries, with GPT-4o outperforming the other LLMs. Additionally, both humans and LLMs showed consistent decision-making, but humans experienced fatigue, impacting their accuracy over time. Our results indicate that LLMs demonstrate potential for being considered as automated evaluators for bug report summarization, which could allow scaling up evaluations while reducing human evaluators effort and fatigue.
Paper Structure (10 sections, 1 equation, 5 figures)

This paper contains 10 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: The overview of our study
  • Figure 2: Comparison of Human Evaluators' and LLMs' accuracy in evaluating bug titles and bug summaries.
  • Figure 3: Performance of Human Evaluators and LLMs in assessing factual correctness of the summaries
  • Figure 4: Performance of Human Evaluators and LLMs in assessing completeness of the summary.
  • Figure 5: Performance of Human Evaluators and LLMs in detecting hallucination in the summaries.