Table of Contents
Fetching ...

Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

Tomáš Filip, Martin Pavlíček, Petr Sosík

TL;DR

This paper tackles aspect-based sentiment analysis on Twitter/X data in underrepresented Eastern European languages (the V4 group) during the Russia–Ukraine conflict. It evaluates multiple mid-to-large language models (BERT, BERTweet, Llama2/3, Mistral) with PEFT fine-tuning on thousands of multilingual tweets, incorporating English translations (DeepL preferred) and using GPT-4 as a non-fine-tuned reference. The results show that fine-tuning can reach near state-of-the-art performance for this narrow, multilingual task, with Llama2 and Mistral delivering the best macro-F1 scores (~72.8) across targets and languages, while translation to English generally improves results and Polish remains particularly challenging. The findings underscore the practicality of cost-efficient fine-tuning for ABSA in low-resource languages, highlight model- and task-dependent variability, and point to future work in bias analysis, knowledge distillation, and integrated data platforms for cyberspace sentiment monitoring.

Abstract

The aspect-based sentiment analysis (ABSA) is a standard NLP task with numerous approaches and benchmarks, where large language models (LLM) represent the current state-of-the-art. We focus on ABSA subtasks based on Twitter/X data in underrepresented languages. On such narrow tasks, small tuned language models can often outperform universal large ones, providing available and cheap solutions. We fine-tune several LLMs (BERT, BERTweet, Llama2, Llama3, Mistral) for classification of sentiment towards Russia and Ukraine in the context of the ongoing military conflict. The training/testing dataset was obtained from the academic API from Twitter/X during 2023, narrowed to the languages of the V4 countries (Czech Republic, Slovakia, Poland, Hungary). Then we measure their performance under a variety of settings including translations, sentiment targets, in-context learning and more, using GPT4 as a reference model. We document several interesting phenomena demonstrating, among others, that some models are much better fine-tunable on multilingual Twitter tasks than others, and that they can reach the SOTA level with a very small training set. Finally we identify combinations of settings providing the best results.

Fine-tuning multilingual language models in Twitter/X sentiment analysis: a study on Eastern-European V4 languages

TL;DR

This paper tackles aspect-based sentiment analysis on Twitter/X data in underrepresented Eastern European languages (the V4 group) during the Russia–Ukraine conflict. It evaluates multiple mid-to-large language models (BERT, BERTweet, Llama2/3, Mistral) with PEFT fine-tuning on thousands of multilingual tweets, incorporating English translations (DeepL preferred) and using GPT-4 as a non-fine-tuned reference. The results show that fine-tuning can reach near state-of-the-art performance for this narrow, multilingual task, with Llama2 and Mistral delivering the best macro-F1 scores (~72.8) across targets and languages, while translation to English generally improves results and Polish remains particularly challenging. The findings underscore the practicality of cost-efficient fine-tuning for ABSA in low-resource languages, highlight model- and task-dependent variability, and point to future work in bias analysis, knowledge distillation, and integrated data platforms for cyberspace sentiment monitoring.

Abstract

The aspect-based sentiment analysis (ABSA) is a standard NLP task with numerous approaches and benchmarks, where large language models (LLM) represent the current state-of-the-art. We focus on ABSA subtasks based on Twitter/X data in underrepresented languages. On such narrow tasks, small tuned language models can often outperform universal large ones, providing available and cheap solutions. We fine-tune several LLMs (BERT, BERTweet, Llama2, Llama3, Mistral) for classification of sentiment towards Russia and Ukraine in the context of the ongoing military conflict. The training/testing dataset was obtained from the academic API from Twitter/X during 2023, narrowed to the languages of the V4 countries (Czech Republic, Slovakia, Poland, Hungary). Then we measure their performance under a variety of settings including translations, sentiment targets, in-context learning and more, using GPT4 as a reference model. We document several interesting phenomena demonstrating, among others, that some models are much better fine-tunable on multilingual Twitter tasks than others, and that they can reach the SOTA level with a very small training set. Finally we identify combinations of settings providing the best results.
Paper Structure (8 sections, 4 figures, 6 tables)

This paper contains 8 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of the experimental pipeline. The downloaded dataset was split into three language-specific parts which were annotated. Three version versions of translation (Helsinki, DeepL, none) were prepared, obtaining 9 individual datasets. Finally, the models were fine-tuned and tested. Experiments were run in four variants, combining classification into two/three classes and training with/without reference tweets.
  • Figure 2: Macro-averaged F1-score by language models and translation, as listed in Table \ref{['tab:avg-model-target-translator']}.
  • Figure 3: Macro-averaged F1-score by models and languages of tweets, averaged over all types of translation, both aspects and (non)use of the reference tweet, i.e., each score is an average of 12 experiments.
  • Figure 4: Macro-averaged F1-score by language models and translation for two-valued classification (positive/negative), averaged over all types of translation, both aspects and (non)use of the reference tweet, i.e., each score is an average of 12 experiments