An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features
Shabbir Anees, Anshuman, Ayush Chaurasia, Prathmesh Bogar
TL;DR
This study tackles the detection of computer-generated reviews in online marketplaces by building a feature-based ML pipeline that fuses large multi-modal textual features with Harris Hawks Optimization for feature selection and a stacking ensemble classifier. The approach reduces the high-dimensional feature space from 13,539 to 1,368 features and achieves superior performance (95.40% accuracy, 0.992 AUC) on a public CG-vs-human dataset, while addressing privacy considerations for cloud-based deployments. Key contributions include a transparent feature-based framework, demonstrable robustness via cross-validation, and thoughtful discussion of privacy-preserving deployment strategies. The work holds practical impact for scalable, transparent fake-review detection in real-world platforms, with avenues to enhance performance using deep-learning embeddings in future work.
Abstract
It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.
