Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models

Ting Zhang; Ivana Clairine Irsan; Ferdian Thung; David Lo

Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models

Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, David Lo

TL;DR

This study investigates bigger large language models (bLLMs) in addressing the labeled data shortage that hampers fine-tuned smaller large language models (sLLMs) in software engineering tasks, and compares them with fine-tuned sLLMs, using sLLMs to learn contextual embeddings of text from software platforms.

Abstract

Software development involves collaborative interactions where stakeholders express opinions across various platforms. Recognizing the sentiments conveyed in these interactions is crucial for the effective development and ongoing maintenance of software systems. For software products, analyzing the sentiment of user feedback, e.g., reviews, comments, and forum posts can provide valuable insights into user satisfaction and areas for improvement. This can guide the development of future updates and features. However, accurately identifying sentiments in software engineering datasets remains challenging. This study investigates bigger large language models (bLLMs) in addressing the labeled data shortage that hampers fine-tuned smaller large language models (sLLMs) in software engineering tasks. We conduct a comprehensive empirical study using five established datasets to assess three open-source bLLMs in zero-shot and few-shot scenarios. Additionally, we compare them with fine-tuned sLLMs, using sLLMs to learn contextual embeddings of text from software platforms. Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions. bLLMs can also achieve excellent performance under a zero-shot setting. However, when ample training data is available or the dataset exhibits a more balanced distribution, fine-tuned sLLMs can still achieve superior results.

Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 7 figures, 8 tables)

This paper contains 21 sections, 7 equations, 7 figures, 8 tables.

Introduction
Related Work
Boosting SA4SE Accuracy
Empirical Studies in SA4SE
Experimental Setup
Research Questions
Method
Dataset
Evaluated Language Models
Evaluation Metrics
Implementation Details
Results
RQ1: Impact of different prompts on the performance of bLLMs with zero-shot learning
RQ2: Impact of different shots on the performance of bLLMs with few-shot learning
RQ3: Comparison between fine-tuned sLLMs and bLLMs
...and 6 more sections

Figures (7)

Figure 1: The zero-shot prompt templates we utilized when running Vicunavicuna2023 and WizardLMxu2023wizardlm
Figure 2: Few-shot prompt template (with $k=1$) utilized by Llama 2-Chattouvron2023llama2.
Figure 3: One example to get the prediction probability scores from the bLLMs.
Figure 4: Sensitivity of different prompt designs. The circles depicted in the figure represent outlier data points.
Figure 5: Comparison of the highest macro-F1 and micro-F1 scores achieved through zero-shot learning and few-shot learning.
...and 2 more figures

Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models

TL;DR

Abstract

Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)