Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages
Tomáš Filip, Martin Pavlíček, Petr Sosík
TL;DR
This paper tackles aspect-based sentiment analysis on Twitter/X data in underrepresented Eastern European languages (the V4 group) during the Russia–Ukraine conflict. It evaluates multiple mid-to-large language models (BERT, BERTweet, Llama2/3, Mistral) with PEFT fine-tuning on thousands of multilingual tweets, incorporating English translations (DeepL preferred) and using GPT-4 as a non-fine-tuned reference. The results show that fine-tuning can reach near state-of-the-art performance for this narrow, multilingual task, with Llama2 and Mistral delivering the best macro-F1 scores (~72.8) across targets and languages, while translation to English generally improves results and Polish remains particularly challenging. The findings underscore the practicality of cost-efficient fine-tuning for ABSA in low-resource languages, highlight model- and task-dependent variability, and point to future work in bias analysis, knowledge distillation, and integrated data platforms for cyberspace sentiment monitoring.
Abstract
The aspect-based sentiment analysis (ABSA) is a standard NLP task with numerous approaches and benchmarks, where large language models (LLM) represent the current state-of-the-art. We focus on ABSA subtasks based on Twitter/X data in underrepresented languages. On such narrow tasks, small tuned language models can often outperform universal large ones, providing available and cheap solutions. We fine-tune several LLMs (BERT, BERTweet, Llama2, Llama3, Mistral) for classification of sentiment towards Russia and Ukraine in the context of the ongoing military conflict. The training/testing dataset was obtained from the academic API from Twitter/X during 2023, narrowed to the languages of the V4 countries (Czech Republic, Slovakia, Poland, Hungary). Then we measure their performance under a variety of settings including translations, sentiment targets, in-context learning and more, using GPT4 as a reference model. We document several interesting phenomena demonstrating, among others, that some models are much better fine-tunable on multilingual Twitter tasks than others, and that they can reach the SOTA level with a very small training set. Finally we identify combinations of settings providing the best results.
