Leveraging Encoder-only Large Language Models for Mobile App Review Feature Extraction
Quim Motger, Alessio Miaschi, Felice Dell'Orletta, Xavier Franch, Jordi Marco
TL;DR
This paper addresses the challenge of extracting user-visible features from noisy mobile app reviews by reframing feature extraction as a supervised token-classification task using encoder-only LLMs. It introduces T-FREX, a baseline system and two extensions: extended pre-training on a large corpus of app reviews and an instance selection mechanism to reduce training data while preserving quality, including a CDIS-based approach for NER-style training. Empirical evaluation across multiple encoder-only models shows that extended pre-training and instance selection can improve functional correctness (precision, recall, and f_beta) and/or training efficiency, with XLNetlarge frequently providing the best in-domain performance and base models benefiting notably from data selection. The work provides ground-truth and extended datasets, publicly available models, and practical methods for integrating encoder-only LLMs into review-mining pipelines, offering actionable improvements for feature prioritization and sentiment analysis in mobile app contexts.
Abstract
Mobile app review analysis presents unique challenges due to the low quality, subjective bias, and noisy content of user-generated documents. Extracting features from these reviews is essential for tasks such as feature prioritization and sentiment analysis, but it remains a challenging task. Meanwhile, encoder-only models based on the Transformer architecture have shown promising results for classification and information extraction tasks for multiple software engineering processes. This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews. By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task. Our approach includes extending the pre-training of these models with a large corpus of user reviews to improve contextual understanding and employing instance selection techniques to optimize model fine-tuning. Empirical evaluations demonstrate that this method improves the precision and recall of extracted features and enhances performance efficiency. Key contributions include a novel approach to feature extraction, annotated datasets, extended pre-trained models, and an instance selection mechanism for cost-effective fine-tuning. This research provides practical methods and empirical evidence in applying large language models to natural language processing tasks within mobile app reviews, offering improved performance in feature extraction.
