A Cross-Validation Study of Turkish Sentiment Analysis Datasets and Tools
Şevval Çakıcı, Dilara Karaduman, Mehmet Akif Çırlan, Ali Hürriyetoğlu
TL;DR
The paper addresses fragmentation in Turkish sentiment analysis by performing a systematic literature review of 2012–2022 publications, compiling 23 datasets from 31 studies, and applying a Rodrigues2018-based taxonomy to map study characteristics. It benchmarks four state-of-the-art models (XLM-T, BERTurk with BounTi, TSAM, TurkishBERTweet) across diverse Turkish datasets, revealing that dataset properties and format alignment significantly influence performance. Key findings show strong binary-class performance for XLM-T and BERTurk, with TSAM performing exceptionally on the Humir dataset, while multi-class results are more dataset-dependent. The work provides a public data/resource hub and emphasizes the need for standardized Turkish sentiment benchmarks and robust dataset curation to enable meaningful cross-study comparisons.
Abstract
In recent years, sentiment analysis has gained increasing significance, prompting researchers to explore datasets in various languages, including Turkish. However, the limited availability of Turkish datasets has led to their multifaceted usage in different studies, yielding diverse outcomes. To overcome this challenge, a rigorous review was conducted of research articles published between 2012 and 2022. 31 studies were listed, and 23 Turkish datasets obtained from publicly available sources and email requests used in these studies were collected. We labeled these 31 studies using a taxonomy. We provide a map of sentiment analysis datasets according to this taxonomy in Turkish over 10 years. Moreover, we run state-of-the-art sentiment analysis tools on these datasets and analyzed performance across popular Turkish sentiment datasets. We observed that the performance of the sentiment analysis tools significantly depends on the characteristics of the target text. Our study fosters a more nuanced understanding of sentiment analysis in the Turkish language.
