Sentiment Analysis in Software Engineering: Evaluating Generative Pre-trained Transformers
KM Khalid Saifullah, Faiaz Azmain, Habiba Hye
TL;DR
This paper addresses sentiment analysis in software engineering by directly comparing bidirectional transformers (BERT) with generative transformers (GPT-4o-mini) across three SE datasets: GitHub, Stack Overflow, and Jira. It evaluates both fine-tuned and default GPT-4o-mini configurations, revealing that fine-tuned GPT-4o-mini matches or surpasses BERT on balanced datasets (GitHub, Jira), while the default GPT-4o-mini demonstrates stronger generalization on the linguistically complex, imbalanced SO data. The results highlight important trade-offs between fine-tuning and leveraging pre-trained generative models, emphasizing that dataset characteristics drive model choice and performance. The study provides practical guidance for SE sentiment tooling and suggests directions for improving domain-specific sentiment analysis using transformer architectures.
Abstract
Sentiment analysis plays a crucial role in understanding developer interactions, issue resolutions, and project dynamics within software engineering (SE). While traditional SE-specific sentiment analysis tools have made significant strides, they often fail to account for the nuanced and context-dependent language inherent to the domain. This study systematically evaluates the performance of bidirectional transformers, such as BERT, against generative pre-trained transformers, specifically GPT-4o-mini, in SE sentiment analysis. Using datasets from GitHub, Stack Overflow, and Jira, we benchmark the models' capabilities with fine-tuned and default configurations. The results reveal that fine-tuned GPT-4o-mini performs comparable to BERT and other bidirectional models on structured and balanced datasets like GitHub and Jira, achieving macro-averaged F1-scores of 0.93 and 0.98, respectively. However, on linguistically complex datasets with imbalanced sentiment distributions, such as Stack Overflow, the default GPT-4o-mini model exhibits superior generalization, achieving an accuracy of 85.3\% compared to the fine-tuned model's 13.1\%. These findings highlight the trade-offs between fine-tuning and leveraging pre-trained models for SE tasks. The study underscores the importance of aligning model architectures with dataset characteristics to optimize performance and proposes directions for future research in refining sentiment analysis tools tailored to the SE domain.
