Sentiment Analysis in SemEval: A Review of Sentiment Identification Approaches
Bousselham El Haddaoui, Raddouane Chiheb, Rdouan Faizi, Abdellatif El Afia
TL;DR
The paper surveys top-performing SemEval sentiment-analysis systems from 2013 to 2021, tracing how data collection, preprocessing, representations, and classifiers have evolved in response to social media text. It documents a progression from lexicon-based and traditional ML approaches toward dense embeddings and especially transformer-based pre-trained language models, with ensemble methods and domain-specific adaptations playing notable roles. Datasets are predominantly Twitter-derived, with growing sizes and occasional large-scale semi-supervised corpora like the SOLID dataset, and annotation strategies shifting from manual labeling to crowdsourcing and semi-supervised enrichment. The findings underscore the continued importance of preprocessing, the rise of PLMs for robust performance, and persisting multilingual and domain challenges, offering guidance for rapid prototyping and future SemEval editions.
Abstract
Social media platforms are becoming the foundations of social interactions including messaging and opinion expression. In this regard, Sentiment Analysis techniques focus on providing solutions to ensure the retrieval and analysis of generated data including sentiments, emotions, and discussed topics. International competitions such as the International Workshop on Semantic Evaluation (SemEval) have attracted many researchers and practitioners with a special research interest in building sentiment analysis systems. In our work, we study top-ranking systems for each SemEval edition during the 2013-2021 period, a total of 658 teams participated in these editions with increasing interest over years. We analyze the proposed systems marking the evolution of research trends with a focus on the main components of sentiment analysis systems including data acquisition, preprocessing, and classification. Our study shows an active use of preprocessing techniques, an evolution of features engineering and word representation from lexicon-based approaches to word embeddings, and the dominance of neural networks and transformers over the classification phase fostering the use of ready-to-use models. Moreover, we provide researchers with insights based on experimented systems which will allow rapid prototyping of new systems and help practitioners build for future SemEval editions.
