Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

Tomáš Filip; Martin Pavlíček; Petr Sosík

Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

Tomáš Filip, Martin Pavlíček, Petr Sosík

TL;DR

This paper tackles aspect-based sentiment analysis on Twitter/X data in underrepresented Eastern European languages (the V4 group) during the Russia–Ukraine conflict. It evaluates multiple mid-to-large language models (BERT, BERTweet, Llama2/3, Mistral) with PEFT fine-tuning on thousands of multilingual tweets, incorporating English translations (DeepL preferred) and using GPT-4 as a non-fine-tuned reference. The results show that fine-tuning can reach near state-of-the-art performance for this narrow, multilingual task, with Llama2 and Mistral delivering the best macro-F1 scores (~72.8) across targets and languages, while translation to English generally improves results and Polish remains particularly challenging. The findings underscore the practicality of cost-efficient fine-tuning for ABSA in low-resource languages, highlight model- and task-dependent variability, and point to future work in bias analysis, knowledge distillation, and integrated data platforms for cyberspace sentiment monitoring.

Abstract

The aspect-based sentiment analysis (ABSA) is a standard NLP task with numerous approaches and benchmarks, where large language models (LLM) represent the current state-of-the-art. We focus on ABSA subtasks based on Twitter/X data in underrepresented languages. On such narrow tasks, small tuned language models can often outperform universal large ones, providing available and cheap solutions. We fine-tune several LLMs (BERT, BERTweet, Llama2, Llama3, Mistral) for classification of sentiment towards Russia and Ukraine in the context of the ongoing military conflict. The training/testing dataset was obtained from the academic API from Twitter/X during 2023, narrowed to the languages of the V4 countries (Czech Republic, Slovakia, Poland, Hungary). Then we measure their performance under a variety of settings including translations, sentiment targets, in-context learning and more, using GPT4 as a reference model. We document several interesting phenomena demonstrating, among others, that some models are much better fine-tunable on multilingual Twitter tasks than others, and that they can reach the SOTA level with a very small training set. Finally we identify combinations of settings providing the best results.

Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

TL;DR

Abstract

Paper Structure (8 sections, 4 figures, 6 tables)

This paper contains 8 sections, 4 figures, 6 tables.

Introduction
Background
Methods
Results
Conclusion
Selected experimental results
Prompt for in-context learning
Examples of misclassified tweets

Figures (4)

Figure 1: Illustration of the experimental pipeline. The downloaded dataset was split into three language-specific parts which were annotated. Three version versions of translation (Helsinki, DeepL, none) were prepared, obtaining 9 individual datasets. Finally, the models were fine-tuned and tested. Experiments were run in four variants, combining classification into two/three classes and training with/without reference tweets.
Figure 2: Macro-averaged F1-score by language models and translation, as listed in Table \ref{['tab:avg-model-target-translator']}.
Figure 3: Macro-averaged F1-score by models and languages of tweets, averaged over all types of translation, both aspects and (non)use of the reference tweet, i.e., each score is an average of 12 experiments.
Figure 4: Macro-averaged F1-score by language models and translation for two-valued classification (positive/negative), averaged over all types of translation, both aspects and (non)use of the reference tweet, i.e., each score is an average of 12 experiments

Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

TL;DR

Abstract

Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

Authors

TL;DR

Abstract

Table of Contents

Figures (4)