Table of Contents
Fetching ...

KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

Rustem Yeshpanov, Huseyin Atakan Varol

TL;DR

KazSAnDRA presents the first and largest public Kazakh sentiment analysis dataset, spanning 180,064 reviews with 1–5 ratings across four domains. The authors document end-to-end dataset construction, including data collection, preprocessing, imbalance handling via ROS/RUS, and a two-task setup: polarity and score classification, with careful evaluation of balanced vs. imbalanced training. Four multilingual transformers (mBERT, XLM-R, RemBERT, mBART-50) are fine-tuned and evaluated, with XLM-R and RemBERT achieving the best test $F_1$-scores of $0.81$ (polarity) and $0.39$ (score); results reveal domain and data quality effects on performance, especially for the more granular score task. The dataset and fine-tuned models are openly available under CC BY 4.0 on GitHub, and the work highlights challenges unique to Kazakh, such as spelling inconsistencies and code-switching, pointing to future enhancements like back-translation and standardized annotation guidelines to further improve applicability.

Abstract

This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.

KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

TL;DR

KazSAnDRA presents the first and largest public Kazakh sentiment analysis dataset, spanning 180,064 reviews with 1–5 ratings across four domains. The authors document end-to-end dataset construction, including data collection, preprocessing, imbalance handling via ROS/RUS, and a two-task setup: polarity and score classification, with careful evaluation of balanced vs. imbalanced training. Four multilingual transformers (mBERT, XLM-R, RemBERT, mBART-50) are fine-tuned and evaluated, with XLM-R and RemBERT achieving the best test -scores of (polarity) and (score); results reveal domain and data quality effects on performance, especially for the more granular score task. The dataset and fine-tuned models are openly available under CC BY 4.0 on GitHub, and the work highlights challenges unique to Kazakh, such as spelling inconsistencies and code-switching, pointing to future enhancements like back-translation and standardized annotation guidelines to further improve applicability.

Abstract

This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.
Paper Structure (20 sections, 1 figure, 14 tables)