Table of Contents
Fetching ...

NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis

Mingyang Wang, Heike Adel, Lukas Lange, Jannik Strötgen, Hinrich Schütze

TL;DR

This work tackles sentiment analysis for low-resource African languages using Twitter data from AfriSenti. It combines language- and task-adaptive pretraining on AfroXLM-R with explicit source-language selection for multilingual and zero-shot transfer. The approach yields strong results, achieving top performance in multiple tracks by maximizing positive transfer while avoiding interference from dissimilar languages. The findings underscore the value of targeted pretraining and data-driven source selection for practical NLP in resource-constrained multilingual settings, with implications for scalable sentiment analysis across underrepresented languages. Future work includes automating source selection and examining linguistic correlations among selected sources.

Abstract

This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African Languages using Twitter Dataset". Sentiment analysis is one of the most widely studied applications in natural language processing. However, most prior work still focuses on a small number of high-resource languages. Building reliable sentiment analysis systems for low-resource languages remains challenging, due to the limited training data in this task. In this work, we propose to leverage language-adaptive and task-adaptive pretraining on African texts and study transfer learning with source language selection on top of an African language-centric pretrained language model. Our key findings are: (1) Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points. (2) Selecting source languages with positive transfer gains during training can avoid harmful interference from dissimilar languages, leading to better results in multilingual and cross-lingual settings. In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.

NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis

TL;DR

This work tackles sentiment analysis for low-resource African languages using Twitter data from AfriSenti. It combines language- and task-adaptive pretraining on AfroXLM-R with explicit source-language selection for multilingual and zero-shot transfer. The approach yields strong results, achieving top performance in multiple tracks by maximizing positive transfer while avoiding interference from dissimilar languages. The findings underscore the value of targeted pretraining and data-driven source selection for practical NLP in resource-constrained multilingual settings, with implications for scalable sentiment analysis across underrepresented languages. Future work includes automating source selection and examining linguistic correlations among selected sources.

Abstract

This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African Languages using Twitter Dataset". Sentiment analysis is one of the most widely studied applications in natural language processing. However, most prior work still focuses on a small number of high-resource languages. Building reliable sentiment analysis systems for low-resource languages remains challenging, due to the limited training data in this task. In this work, we propose to leverage language-adaptive and task-adaptive pretraining on African texts and study transfer learning with source language selection on top of an African language-centric pretrained language model. Our key findings are: (1) Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points. (2) Selecting source languages with positive transfer gains during training can avoid harmful interference from dissimilar languages, leading to better results in multilingual and cross-lingual settings. In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.
Paper Structure (21 sections, 7 tables)