Table of Contents
Fetching ...

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić

TL;DR

The paper investigates whether zero-shot prompting of instruction-tuned large language models can match fine-tuned BERT-like models for text classification in South Slavic languages across sentiment, topic, and genre tasks. It conducts a comprehensive benchmark using four dataset families and multiple languages, contrasting open/closed LLMs with task-specific fine-tuned models and evaluating cross-language generalization against English. Results show LLMs often reach top performance in sentiment tasks, while fine-tuned models retain advantages in genre and topic classification; multilingual LLMs exhibit only modest drops relative to English, yet LLMs incur higher computational costs and occasional label hallucinations. The work highlights practical trade-offs between immediacy without training data and the reliability and speed of fine-tuned classifiers, and proposes a LLM teacher-student paradigm to leverage LLMs for annotation while training domain-specific models. It provides a foundation for ongoing multilingual benchmarking and future work on few-shot prompting and hybrid approaches to large-scale text annotation.

Abstract

Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

TL;DR

The paper investigates whether zero-shot prompting of instruction-tuned large language models can match fine-tuned BERT-like models for text classification in South Slavic languages across sentiment, topic, and genre tasks. It conducts a comprehensive benchmark using four dataset families and multiple languages, contrasting open/closed LLMs with task-specific fine-tuned models and evaluating cross-language generalization against English. Results show LLMs often reach top performance in sentiment tasks, while fine-tuned models retain advantages in genre and topic classification; multilingual LLMs exhibit only modest drops relative to English, yet LLMs incur higher computational costs and occasional label hallucinations. The work highlights practical trade-offs between immediacy without training data and the reliability and speed of fine-tuned classifiers, and proposes a LLM teacher-student paradigm to leverage LLMs for annotation while training domain-specific models. It provides a foundation for ongoing multilingual benchmarking and future work on few-shot prompting and hybrid approaches to large-scale text annotation.

Abstract

Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.

Paper Structure

This paper contains 22 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Micro-F1 and macro-F1 scores across models and languages on the test datasets for sentiment classification (Figure \ref{['fig:sentiment-results']}), automatic genre identification (Figure \ref{['fig:genre-results']}), and topic classification on news (Figure \ref{['fig:iptc-results']}) and parliamentary speeches (Figure \ref{['fig:parlacap-results']}).
  • Figure 2: Comparison of LLMs used in a zero-shot prompting fashion on sentiment identification (Figure \ref{['fig:sent-gpt-comparison']}), automatic genre identification (Figure \ref{['fig:genre-gpt-comparison']}), and topic classification on news (Figure \ref{['fig:topic-gpt-comparison']}) and parliamentary speeches (Figure \ref{['fig:parlacap-gpt-comparison']}).
  • Figure 3: Comparison of models on the parliamentary topic classification based on their inference speed (seconds per instance) and performance (macro-F1 scores), both averaged across all four languages.
  • Figure 4: The prompts that are provided to the LLMs for the sentiment identification task (Figure \ref{['fig:sentiment-prompt']}), automatic genre identification (Figure \ref{['fig:genre-prompt']}), and topic classification on news (Figure \ref{['fig:topic-prompt']}) and parliamentary speeches (Figure \ref{['fig:parlacap-prompt']}). The prompts comprise the description of the task and labels with a short description.