Table of Contents
Fetching ...

Exploring Requirements Elicitation from App Store User Reviews Using Large Language Models

Tanmai Kumar Ghosh, Atharva Pargaonkar, Nasir U. Eisty

TL;DR

This study investigates automated requirements elicitation from app-store user reviews by fine-tuning three large language models (BERT, DistilBERT, and GEMMA) on a labeled dataset of reviews classified as useful or not useful for developers. The approach demonstrates high classification performance, with BERT delivering the strongest overall accuracy and GEMMA excelling in recall, indicating complementary strengths for gathering actionable insights. The work highlights the viability of data-driven, scalable RE from unstructured feedback and discusses threats to validity, including dataset size and language generalizability, while outlining future directions such as multilingual support and real-time monitoring to bolster practical impact. Overall, the findings suggest LLM-based review analysis can streamline requirements elicitation and drive more user-centric mobile applications.

Abstract

Mobile applications have become indispensable companions in our daily lives. Spanning over the categories from communication and entertainment to healthcare and finance, these applications have been influential in every aspect. Despite their omnipresence, developing apps that meet user needs and expectations still remains a challenge. Traditional requirements elicitation methods like user interviews can be time-consuming and suffer from limited scope and subjectivity. This research introduces an approach leveraging the power of Large Language Models (LLMs) to analyze user reviews for automated requirements elicitation. We fine-tuned three well-established LLMs BERT, DistilBERT, and GEMMA, on a dataset of app reviews labeled for usefulness. Our evaluation revealed BERT's superior performance, achieving an accuracy of 92.40% and an F1-score of 92.39%, demonstrating its effectiveness in accurately classifying useful reviews. While GEMMA displayed a lower overall performance, it excelled in recall (93.39%), indicating its potential for capturing a comprehensive set of valuable user insights. These findings suggest that LLMs offer a promising avenue for streamlining requirements elicitation in mobile app development, leading to the creation of more user-centric and successful applications.

Exploring Requirements Elicitation from App Store User Reviews Using Large Language Models

TL;DR

This study investigates automated requirements elicitation from app-store user reviews by fine-tuning three large language models (BERT, DistilBERT, and GEMMA) on a labeled dataset of reviews classified as useful or not useful for developers. The approach demonstrates high classification performance, with BERT delivering the strongest overall accuracy and GEMMA excelling in recall, indicating complementary strengths for gathering actionable insights. The work highlights the viability of data-driven, scalable RE from unstructured feedback and discusses threats to validity, including dataset size and language generalizability, while outlining future directions such as multilingual support and real-time monitoring to bolster practical impact. Overall, the findings suggest LLM-based review analysis can streamline requirements elicitation and drive more user-centric mobile applications.

Abstract

Mobile applications have become indispensable companions in our daily lives. Spanning over the categories from communication and entertainment to healthcare and finance, these applications have been influential in every aspect. Despite their omnipresence, developing apps that meet user needs and expectations still remains a challenge. Traditional requirements elicitation methods like user interviews can be time-consuming and suffer from limited scope and subjectivity. This research introduces an approach leveraging the power of Large Language Models (LLMs) to analyze user reviews for automated requirements elicitation. We fine-tuned three well-established LLMs BERT, DistilBERT, and GEMMA, on a dataset of app reviews labeled for usefulness. Our evaluation revealed BERT's superior performance, achieving an accuracy of 92.40% and an F1-score of 92.39%, demonstrating its effectiveness in accurately classifying useful reviews. While GEMMA displayed a lower overall performance, it excelled in recall (93.39%), indicating its potential for capturing a comprehensive set of valuable user insights. These findings suggest that LLMs offer a promising avenue for streamlining requirements elicitation in mobile app development, leading to the creation of more user-centric and successful applications.
Paper Structure (22 sections, 4 figures, 2 tables)

This paper contains 22 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: App Reviews distribution
  • Figure 2: Distribution of Useful and Non-Useful labels across Apps
  • Figure 3: Process Diagram
  • Figure 4: Performance Comparison