State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Taja Kuzman Pungeršek; Peter Rupnik; Ivan Porupski; Vuk Dinić; Nikola Ljubešić

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić

TL;DR

The paper investigates whether zero-shot prompting of instruction-tuned large language models can match fine-tuned BERT-like models for text classification in South Slavic languages across sentiment, topic, and genre tasks. It conducts a comprehensive benchmark using four dataset families and multiple languages, contrasting open/closed LLMs with task-specific fine-tuned models and evaluating cross-language generalization against English. Results show LLMs often reach top performance in sentiment tasks, while fine-tuned models retain advantages in genre and topic classification; multilingual LLMs exhibit only modest drops relative to English, yet LLMs incur higher computational costs and occasional label hallucinations. The work highlights practical trade-offs between immediacy without training data and the reliability and speed of fine-tuned classifiers, and proposes a LLM teacher-student paradigm to leverage LLMs for annotation while training domain-specific models. It provides a foundation for ongoing multilingual benchmarking and future work on few-shot prompting and hybrid approaches to large-scale text annotation.

Abstract

Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

TL;DR

Abstract

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)