How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

Faiz Ali Shah; Ahmed Sabir; Rajesh Sharma; Dietmar Pfahl

How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma, Dietmar Pfahl

TL;DR

This paper investigates how effectively large language models can extract feature–sentiment pairs from app reviews, comparing zero-shot and few-shot prompting across proprietary (GPT-4, ChatGPT) and open-source (LLaMA-2) models against rule-based and supervised baselines. The authors use a labeled dataset of 1000 reviews over 8 apps to evaluate feature extraction and feature-specific sentiment prediction, employing exact and partial matching. Key findings show GPT-4 is the strongest zero-shot performer for feature extraction among LLMs, while a fine-tuned RE-BERT can outperform it in some settings; few-shot prompting further improves results, with GPT-4 achieving notable gains in positive and neutral sentiment predictions. The results inform both research directions and practical prompting strategies, highlighting the potential and current limits of LLMs for fine-grained analysis of user feedback in software maintenance and evolution.

Abstract

Automatic analysis of user reviews to understand user sentiments toward app functionality (i.e. app features) helps align development efforts with user expectations and needs. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples, but the capabilities of LLMs are yet unexplored for feature-specific sentiment analysis. The goal of our study is to explore the capabilities of LLMs to perform feature-specific sentiment analysis of user reviews. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and different variants of Llama-2 chat, against previous approaches for extracting app features and associated sentiments in zero-shot, 1-shot, and 5-shot scenarios. The results indicate that GPT-4 outperforms the rule-based SAFE by 17% in f1-score for extracting app features in the zero-shot scenario, with 5-shot further improving it by 6%. However, the fine-tuned RE-BERT exceeds GPT-4 by 6% in f1-score. For predicting positive and neutral sentiments, GPT-4 achieves f1-scores of 76% and 45% in the zero-shot setting, which improve by 7% and 23% in the 5-shot setting, respectively. Our study conducts a thorough evaluation of both proprietary and open-source LLMs to provide an objective assessment of their performance in extracting feature-sentiment pairs.

How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

TL;DR

Abstract

Paper Structure (25 sections, 4 figures, 7 tables)

This paper contains 25 sections, 4 figures, 7 tables.

Introduction
Background
Related work
Rule-based methods
Supervised Learning
LLMs with RLHF
Experimental Settings
Labeled dataset
Experimented LLMs
Baselines
Evaluation procedure
Implementation Details
Results
RQ1 - Comparison of zero-shot LLM performance and baseline methods for extracting feature-sentiment pairs
Feature extraction performance
...and 10 more sections

Figures (4)

Figure 1: Our approach for evaluating zero-shot and few-shot capabilities of LLMs for extracting (app feature, sentiment) pairs from a user review.
Figure 2: Comparison of 0-shot performances of LLMs in predicting sentiment with S-prompt (upper plot) and L-prompt (lower plot) and using partial match $(n=2)$. All models, except LLama-7B, demonstrate improved performance with the L-prompt for predicting neutral sentiment.
Figure 3: Comparison of zero-shot, 1-shot, and 5-shot performances of GPT4, ChatGPT, and LLama-70B in predicting feature-specific sentiment. 5-shot shows an increase of 7% and 3% in the f1-score of GPT-4 and LLama-70B for predicting positive sentiment. For the neutral sentiment, GPT-4 and LLama-70B f1 performance is improved by 23% and 14% with 5-shot.
Figure 4: Comparison of 5-shot f1 performance of GPT-4, ChatGPT, LLama-2-70B Chat for individual app.

How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

TL;DR

Abstract

How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)