Polish-ASTE: Aspect-Sentiment Triplet Extraction Datasets for Polish
Marta Lango, Borys Naglik, Mateusz Lango, Iwo Naglik
TL;DR
This work addresses the scarcity of ASTE resources for Slavic languages by constructing two Polish ASTE datasets (hotels and products) drawn from the WCCRS corpus and annotated with aspect, opinion, and polarity. It systematically evaluates ASTE methods—Grid Tagging Scheme (GTS) and Exploiting Phrase Interrelations Span-level Approach (EPISA)—using Polish language models (TrelBERT, HerBERT) to benchmark cross-language transferability. Results show Polish ASTE remains significantly more challenging than English, with EPISA and TrelBERT delivering the best performance yet still lagging English baselines, indicating substantial room for methodological advances. The open licensing and English-format structure of the Polish datasets will facilitate cross-language benchmarking, multi-domain experiments, and broader adoption in NLP research for under-resourced languages.
Abstract
Aspect-Sentiment Triplet Extraction (ASTE) is one of the most challenging and complex tasks in sentiment analysis. It concerns the construction of triplets that contain an aspect, its associated sentiment polarity, and an opinion phrase that serves as a rationale for the assigned polarity. Despite the growing popularity of the task and the many machine learning methods being proposed to address it, the number of datasets for ASTE is very limited. In particular, no dataset is available for any of the Slavic languages. In this paper, we present two new datasets for ASTE containing customer opinions about hotels and purchased products expressed in Polish. We also perform experiments with two ASTE techniques combined with two large language models for Polish to investigate their performance and the difficulty of the assembled datasets. The new datasets are available under a permissive licence and have the same file format as the English datasets, facilitating their use in future research.
