Table of Contents
Fetching ...

BanglishRev: A Large-Scale Bangla-English and Code-mixed Dataset of Product Reviews in E-Commerce

Mohammad Nazmush Shamael, Sabila Nawshin, Swakkhar Shatabda, Salekul Islam

TL;DR

This work introduces BanglishRev, the largest Bengali-focused e-commerce review dataset to date, encompassing $1.74$ million reviews and $3.2$ million ratings across $128{,}543$ products from Daraz Bangladesh, with rich metadata and associated images. It investigates sentiment analysis by training BanglishBERT on rating-derived labels and evaluating on a manually annotated benchmark, achieving high performance with $0.94$ accuracy and $0.94$ F1. The dataset enables robust NLP for Bangla in multilingual and code-mixed contexts and supports broader analyses beyond sentiment, such as market and behavior studies, while acknowledging ethical considerations and computational challenges. The work provides a practical contribution to Bangla NLP and cross-lingual sentiment research, with the dataset publicly available on HuggingFace for further benchmarking and downstream tasks.

Abstract

This work presents the BanglishRev Dataset, the largest e-commerce product review dataset to date for reviews written in Bengali, English, a mixture of both and Banglish, Bengali words written with English alphabets. The dataset comprises of 1.74 million written reviews from 3.2 million ratings information collected from a total of 128k products being sold in online e-commerce platforms targeting the Bengali population. It includes an extensive array of related metadata for each of the reviews including the rating given by the reviewer, date the review was posted and date of purchase, number of likes, dislikes, response from the seller, images associated with the review etc. With sentiment analysis being the most prominent usage of review datasets, experimentation with a binary sentiment analysis model with the review rating serving as an indicator of positive or negative sentiment was conducted to evaluate the effectiveness of the large amount of data presented in BanglishRev for sentiment analysis tasks. A BanglishBERT model is trained on the data from BanglishRev with reviews being considered labeled positive if the rating is greater than 3 and negative if the rating is less than or equal to 3. The model is evaluated by being testing against a previously published manually annotated dataset for e-commerce reviews written in a mixture of Bangla, English and Banglish. The experimental model achieved an exceptional accuracy of 94\% and F1 score of 0.94, demonstrating the dataset's efficacy for sentiment analysis. Some of the intriguing patterns and observations seen within the dataset and future research directions where the dataset can be utilized is also discussed and explored. The dataset can be accessed through https://huggingface.co/datasets/BanglishRev/bangla-english-and-code-mixed-ecommerce-review-dataset.

BanglishRev: A Large-Scale Bangla-English and Code-mixed Dataset of Product Reviews in E-Commerce

TL;DR

This work introduces BanglishRev, the largest Bengali-focused e-commerce review dataset to date, encompassing million reviews and million ratings across products from Daraz Bangladesh, with rich metadata and associated images. It investigates sentiment analysis by training BanglishBERT on rating-derived labels and evaluating on a manually annotated benchmark, achieving high performance with accuracy and F1. The dataset enables robust NLP for Bangla in multilingual and code-mixed contexts and supports broader analyses beyond sentiment, such as market and behavior studies, while acknowledging ethical considerations and computational challenges. The work provides a practical contribution to Bangla NLP and cross-lingual sentiment research, with the dataset publicly available on HuggingFace for further benchmarking and downstream tasks.

Abstract

This work presents the BanglishRev Dataset, the largest e-commerce product review dataset to date for reviews written in Bengali, English, a mixture of both and Banglish, Bengali words written with English alphabets. The dataset comprises of 1.74 million written reviews from 3.2 million ratings information collected from a total of 128k products being sold in online e-commerce platforms targeting the Bengali population. It includes an extensive array of related metadata for each of the reviews including the rating given by the reviewer, date the review was posted and date of purchase, number of likes, dislikes, response from the seller, images associated with the review etc. With sentiment analysis being the most prominent usage of review datasets, experimentation with a binary sentiment analysis model with the review rating serving as an indicator of positive or negative sentiment was conducted to evaluate the effectiveness of the large amount of data presented in BanglishRev for sentiment analysis tasks. A BanglishBERT model is trained on the data from BanglishRev with reviews being considered labeled positive if the rating is greater than 3 and negative if the rating is less than or equal to 3. The model is evaluated by being testing against a previously published manually annotated dataset for e-commerce reviews written in a mixture of Bangla, English and Banglish. The experimental model achieved an exceptional accuracy of 94\% and F1 score of 0.94, demonstrating the dataset's efficacy for sentiment analysis. Some of the intriguing patterns and observations seen within the dataset and future research directions where the dataset can be utilized is also discussed and explored. The dataset can be accessed through https://huggingface.co/datasets/BanglishRev/bangla-english-and-code-mixed-ecommerce-review-dataset.

Paper Structure

This paper contains 12 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Sample Banglish, Bangla and Code-mixed texts and their corresponding English translation
  • Figure 2: Density plot of product rating
  • Figure 3: Distribution of reviews by Language
  • Figure 4: Rating distribution per root category
  • Figure 5: Word count distribution
  • ...and 2 more figures