AlbNews: A Corpus of Headlines for Topic Modeling in Albanian
Erion Çano, Dario Lamaj
TL;DR
AlbNews addresses the scarcity of Albanian NLP resources by introducing AlbNews, a corpus of 600 labeled and 2600 unlabeled Albanian headlines annotated into politics, culture, economy, and sport. The dataset enables both topic modeling and text classification research, with headlines collected from 2022–2023 and labeled under a straightforward annotation scheme. Baseline experiments using TF-IDF features and traditional classifiers show simple models like Logistic Regression and SVM achieving the best accuracy, while ensemble methods underperform, likely due to limited data and potential overfitting. This work provides a practical baseline and a valuable resource to spur Albanian NLP research, with future work including hyperparameter tuning and the development of Albanian-pretrained LLMs to improve performance on downstream tasks.
Abstract
The scarcity of available text corpora for low-resource languages like Albanian is a serious hurdle for research in natural language processing tasks. This paper introduces AlbNews, a collection of 600 topically labeled news headlines and 2600 unlabeled ones in Albanian. The data can be freely used for conducting topic modeling research. We report the initial classification scores of some traditional machine learning classifiers trained with the AlbNews samples. These results show that basic models outrun the ensemble learning ones and can serve as a baseline for future experiments.
