Table of Contents
Fetching ...

What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews

Quim Motger, Marc Oriol, Max Tiessler, Xavier Franch, Jordi Marco

TL;DR

This paper tackles the lack of fine-grained emotion understanding in mobile app reviews by adapting Plutchik’s eight-emotion taxonomy to this domain. It develops structured annotation guidelines and creates an annotated dataset of 1,112 sentences across 257 apps, enabling multi-label emotion classification and rigorous agreement analysis. The study also evaluates the feasibility of using large language models for automated emotion annotation, comparing them to human labels in terms of agreement and cost, and highlights practical trade-offs and design considerations for integrating emotion analysis into requirements engineering. The findings show that LLMs can substantially reduce manual effort while achieving substantial but imperfect alignment with human annotations, underscoring the value of human-in-the-loop approaches and multi-model ensembles for robust emotion extraction in app reviews. Overall, the work provides replicable artifacts, guidance for adapting emotion taxonomies to software feedback, and concrete insights for building semi-automatic pipelines to enhance feature-emotion analysis, release planning, and issue triaging in software engineering contexts.

Abstract

Opinion mining plays a vital role in analysing user feedback and extracting insights from textual data. While most research focuses on sentiment polarity (e.g., positive, negative, neutral), fine-grained emotion classification in app reviews remains underexplored. Fine-grained emotion classification is thus needed to better understand users' affective responses and support downstream tasks such as feature-emotion analysis, user-oriented release planning, and issue triaging. This paper addresses this gap by identifying and addressing the challenges and limitations in fine-grained emotion analysis in the context of app reviews. Our study adapts Plutchik's emotion taxonomy to app reviews by developing a structured annotation framework and dataset. Through an iterative human annotation process, we define clear annotation guidelines and document key challenges in emotion classification. Additionally, we evaluate the feasibility of automating emotion annotation using large language models, assessing their cost-effectiveness and agreement with human-labelled data. Our findings reveal that while large language models significantly reduce manual effort and maintain substantial agreement with human annotators, full automation remains challenging due to the complexity of emotional interpretation. This work contributes to opinion mining in requirements engineering by providing structured guidelines, an annotated dataset, and insights for developing automated pipelines to capture the complexity of emotions in app reviews.

What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews

TL;DR

This paper tackles the lack of fine-grained emotion understanding in mobile app reviews by adapting Plutchik’s eight-emotion taxonomy to this domain. It develops structured annotation guidelines and creates an annotated dataset of 1,112 sentences across 257 apps, enabling multi-label emotion classification and rigorous agreement analysis. The study also evaluates the feasibility of using large language models for automated emotion annotation, comparing them to human labels in terms of agreement and cost, and highlights practical trade-offs and design considerations for integrating emotion analysis into requirements engineering. The findings show that LLMs can substantially reduce manual effort while achieving substantial but imperfect alignment with human annotations, underscoring the value of human-in-the-loop approaches and multi-model ensembles for robust emotion extraction in app reviews. Overall, the work provides replicable artifacts, guidance for adapting emotion taxonomies to software feedback, and concrete insights for building semi-automatic pipelines to enhance feature-emotion analysis, release planning, and issue triaging in software engineering contexts.

Abstract

Opinion mining plays a vital role in analysing user feedback and extracting insights from textual data. While most research focuses on sentiment polarity (e.g., positive, negative, neutral), fine-grained emotion classification in app reviews remains underexplored. Fine-grained emotion classification is thus needed to better understand users' affective responses and support downstream tasks such as feature-emotion analysis, user-oriented release planning, and issue triaging. This paper addresses this gap by identifying and addressing the challenges and limitations in fine-grained emotion analysis in the context of app reviews. Our study adapts Plutchik's emotion taxonomy to app reviews by developing a structured annotation framework and dataset. Through an iterative human annotation process, we define clear annotation guidelines and document key challenges in emotion classification. Additionally, we evaluate the feasibility of automating emotion annotation using large language models, assessing their cost-effectiveness and agreement with human-labelled data. Our findings reveal that while large language models significantly reduce manual effort and maintain substantial agreement with human annotators, full automation remains challenging due to the complexity of emotional interpretation. This work contributes to opinion mining in requirements engineering by providing structured guidelines, an annotated dataset, and insights for developing automated pipelines to capture the complexity of emotions in app reviews.

Paper Structure

This paper contains 31 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Research method.
  • Figure 2: Results from the literature review.
  • Figure 3: Distribution of emotions in the literature review (including only those that appear in more than one study).
  • Figure 4: Evolution of the average Cohen's Kappa agreement across iterations.
  • Figure 5: Evaluation of human agreement vs. LLM-based agreement
  • ...and 1 more figures