LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
Taja Kuzman, Nikola Ljubešić
TL;DR
This work tackles multilingual IPTC news topic classification without manual labeling by employing a GPT-based teacher to automatically annotate a large multilingual corpus, then fine-tuning a smaller XLM-RoBERTa model on this GPT-labeled data. The EMMediaTopic dataset (Ca, Hr, El, Sl) and a high-performing multilingual classifier are released, enabling open access to a practical IPTC topic classifier across 100 languages via the XLM-RoBERTa backbone. The study demonstrates strong teacher performance comparable to human annotators, substantial data-efficiency for the student model (with near-teacher results at 15k labeled examples), and robust zero-shot cross-lingual capabilities. These findings support scalable, multilingual news topic classification with reduced manual annotation costs, while highlighting limitations and avenues for future improvements such as multi-label and deeper IPTC level classifications.
Abstract
With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news topic classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop a news topic training dataset through automatic annotation of 20,000 news articles in Slovenian, Croatian, Greek, and Catalan. Articles are classified into 17 main categories from the Media Topic schema, developed by the International Press Telecommunications Council (IPTC). The teacher model exhibits high zero-shot performance in all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual, and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.
