Evaluating the Effectiveness of Pre-trained Language Models in Predicting the Helpfulness of Online Product Reviews
Ali Boluki, Javad Pourmostafa Roshan Sharami, Dimitar Shterionov
TL;DR
This work addresses predicting the helpfulness of online product reviews before any voting signals are observed, tackling ranking in high-volume marketplaces. It systematically compares monolingual RoBERTa and multilingual XLM-R pretrained transformers against a LIWC-based Random Forest baseline on Amazon reviews across four product categories, including an analysis of supplementary features like star rating and review length. The study finds that pretrained transformers substantially outperform the baseline, with RoBERTa typically outperforming XLM-R and additional features providing further gains; fine-tuning depth is optimally shallow (often the top 2 layers). The results support scalable, data-efficient ranking of helpful reviews and suggest that multilingual models may not offer a clear advantage for English-only data, while paving the way for future multilingual evaluations and broader dataset testing.
Abstract
Businesses and customers can gain valuable information from product reviews. The sheer number of reviews often necessitates ranking them based on their potential helpfulness. However, only a few reviews ever receive any helpfulness votes on online marketplaces. Sorting all reviews based on the few existing votes can cause helpful reviews to go unnoticed because of the limited attention span of readers. The problem of review helpfulness prediction is even more important for higher review volumes, and newly written reviews or launched products. In this work we compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews. The contributions of our work in relation to literature include extensively investigating the efficacy of state-of-the-art language models -- both monolingual and multilingual -- against a robust baseline, taking ranking metrics into account when assessing these approaches, and assessing multilingual models for the first time. We employ the Amazon review dataset for our experiments. According to our study on several product categories, multilingual and monolingual pre-trained language models outperform the baseline that utilizes random forest with handcrafted features as much as 23% in RMSE. Pre-trained language models reduce the need for complex text feature engineering. However, our results suggest that pre-trained multilingual models may not be used for fine-tuning only one language. We assess the performance of language models with and without additional features. Our results show that including additional features like product rating by the reviewer can further help the predictive methods.
