Table of Contents
Fetching ...

Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction Steps

Junayed Mahmud, Antu Saha, Oscar Chaparro, Kevin Moran, Andrian Marcus

TL;DR

AstroBR tackles the problem of unclear and incomplete steps to reproduce bugs by marrying GPT-4-driven natural language processing with a graph-based, dynamic GUI execution model to identify, extract, and map S2Rs to GUI interactions. The method comprises four phases—S2R sentence identification, individual S2R extraction, app execution model generation, and S2R quality assessment with missing-step generation—driven by prompts designed and evaluated across zero-shot, few-shot, and chain-of-thought strategies. It establishes a ground-truth dataset and demonstrates that AstroBR achieves a $25.2\%$ higher F1 in S2R quality annotations and a $71.4\%$ higher F1 in missing-S2R detection compared with the state-of-the-art Euler method, using 21 Android bug reports. The work provides replication data and a robust evaluation framework, enabling broader adoption and extension to additional interaction types and datasets.

Abstract

Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues. One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes. Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs. However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information. Prior techniques often struggle to form such language to program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis. We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline. Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score).

Combining Language and App UI Analysis for the Automated Assessment of Bug Reproduction Steps

TL;DR

AstroBR tackles the problem of unclear and incomplete steps to reproduce bugs by marrying GPT-4-driven natural language processing with a graph-based, dynamic GUI execution model to identify, extract, and map S2Rs to GUI interactions. The method comprises four phases—S2R sentence identification, individual S2R extraction, app execution model generation, and S2R quality assessment with missing-step generation—driven by prompts designed and evaluated across zero-shot, few-shot, and chain-of-thought strategies. It establishes a ground-truth dataset and demonstrates that AstroBR achieves a higher F1 in S2R quality annotations and a higher F1 in missing-S2R detection compared with the state-of-the-art Euler method, using 21 Android bug reports. The work provides replication data and a robust evaluation framework, enabling broader adoption and extension to additional interaction types and datasets.

Abstract

Bug reports are essential for developers to confirm software problems, investigate their causes, and validate fixes. Unfortunately, reports often miss important information or are written unclearly, which can cause delays, increased issue resolution effort, or even the inability to solve issues. One of the most common components of reports that are problematic is the steps to reproduce the bug(s) (S2Rs), which are essential to replicate the described program failures and reason about fixes. Given the proclivity for deficiencies in reported S2Rs, prior work has proposed techniques that assist reporters in writing or assessing the quality of S2Rs. However, automated understanding of S2Rs is challenging, and requires linking nuanced natural language phrases with specific, semantically related program information. Prior techniques often struggle to form such language to program connections - due to issues in language variability and limitations of information gleaned from program analyses. To more effectively tackle the problem of S2R quality annotation, we propose a new technique called AstroBR, which leverages the language understanding capabilities of LLMs to identify and extract the S2Rs from bug reports and map them to GUI interactions in a program state model derived via dynamic analysis. We compared AstroBR to a related state-of-the-art approach and we found that AstroBR annotates S2Rs 25.2% better (in terms of F1 score) than the baseline. Additionally, AstroBR suggests more accurate missing S2Rs than the baseline (by 71.4% in terms of F1 score).

Paper Structure

This paper contains 28 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Bug Report Quality Annotations
  • Figure 2: The AstroBR Approach
  • Figure 3: Structure of the Developed Prompts
  • Figure 4: # of Missing Steps Generated by AstroBR