Table of Contents
Fetching ...

Automatically Identifying Solution-Related Content in Issue Report Discussions with Language Models

Antu Saha, Mehedi Sun, Oscar Chaparro

TL;DR

This work tackles the problem of automatically identifying solution-related content within issue report discussions by evaluating a spectrum of language-model–based classifiers. It systematically compares embeddings-based machine learning, prompting, and fine-tuning across MLMs, PLMs, and LLMs using a Mozilla Firefox dataset, with key findings that fine-tuned Llama-3_ft and RoBERTa deliver the strongest performance (0.716 and 0.692 F1, respectively) and that ensemble strategies yield further gains to 0.737 F1. The study also demonstrates cross-project generalization to Chromium and GnuCash, with hybrid training (Mozilla data plus small target-project data) providing the best balance of generalization and adaptation. Practical implications include enabling lightweight embedding-based classifiers for resource-constrained settings and providing publicly available replication artifacts for verification and extension. Overall, the results highlight the value of domain-specific fine-tuning and thoughtful combination of diverse model families to support efficient maintenance and reuse of solutionContent in issue-tracking workflows.

Abstract

During issue resolution, software developers rely on issue reports to discuss solutions for defects, feature requests, and other changes. These discussions contain proposed solutions-from design changes to code implementations-as well as their evaluations. Locating solution-related content is essential for investigating reopened issues, addressing regressions, reusing solutions, and understanding code change rationale. Manually understanding long discussions to identify such content can be difficult and time-consuming. This paper automates solution identification using language models as supervised classifiers. We investigate three applications-embeddings, prompting, and fine-tuning-across three classifier types: traditional ML models (MLMs), pre-trained language models (PLMs), and large language models (LLMs). Using 356 Mozilla Firefox issues, we created a dataset to train and evaluate six MLMs, four PLMs, and two LLMs across 68 configurations. Results show that MLMs with LLM embeddings outperform TF-IDF features, prompting underperforms, and fine-tuned LLMs achieve the highest performance, with LLAMAft reaching 0.716 F1 score. Ensembles of the best models further improve results (0.737 F1). Misclassifications often arise from misleading clues or missing context, highlighting the need for context-aware classifiers. Models trained on Mozilla transfer to other projects, with a small amount of project-specific data, further enhancing results. This work supports software maintenance, issue understanding, and solution reuse.

Automatically Identifying Solution-Related Content in Issue Report Discussions with Language Models

TL;DR

This work tackles the problem of automatically identifying solution-related content within issue report discussions by evaluating a spectrum of language-model–based classifiers. It systematically compares embeddings-based machine learning, prompting, and fine-tuning across MLMs, PLMs, and LLMs using a Mozilla Firefox dataset, with key findings that fine-tuned Llama-3_ft and RoBERTa deliver the strongest performance (0.716 and 0.692 F1, respectively) and that ensemble strategies yield further gains to 0.737 F1. The study also demonstrates cross-project generalization to Chromium and GnuCash, with hybrid training (Mozilla data plus small target-project data) providing the best balance of generalization and adaptation. Practical implications include enabling lightweight embedding-based classifiers for resource-constrained settings and providing publicly available replication artifacts for verification and extension. Overall, the results highlight the value of domain-specific fine-tuning and thoughtful combination of diverse model families to support efficient maintenance and reuse of solutionContent in issue-tracking workflows.

Abstract

During issue resolution, software developers rely on issue reports to discuss solutions for defects, feature requests, and other changes. These discussions contain proposed solutions-from design changes to code implementations-as well as their evaluations. Locating solution-related content is essential for investigating reopened issues, addressing regressions, reusing solutions, and understanding code change rationale. Manually understanding long discussions to identify such content can be difficult and time-consuming. This paper automates solution identification using language models as supervised classifiers. We investigate three applications-embeddings, prompting, and fine-tuning-across three classifier types: traditional ML models (MLMs), pre-trained language models (PLMs), and large language models (LLMs). Using 356 Mozilla Firefox issues, we created a dataset to train and evaluate six MLMs, four PLMs, and two LLMs across 68 configurations. Results show that MLMs with LLM embeddings outperform TF-IDF features, prompting underperforms, and fine-tuned LLMs achieve the highest performance, with LLAMAft reaching 0.716 F1 score. Ensembles of the best models further improve results (0.737 F1). Misclassifications often arise from misleading clues or missing context, highlighting the need for context-aware classifiers. Models trained on Mozilla transfer to other projects, with a small amount of project-specific data, further enhancing results. This work supports software maintenance, issue understanding, and solution reuse.

Paper Structure

This paper contains 47 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Firefox's Issue Report #667586 bug, split into two parts, which contain 36 comments with different kinds of information. Solution-related comments are displayed in the green boxes.
  • Figure 2: Study Methodology
  • Figure 3: Prompt Templates Based on ZS, FS, & CoT Prompting (Issue Description is Optional)
  • Figure 4: Performance of the MLMs across LM and Tf-Idf Embeddings (Aggregating Across the Six MLMs)