Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

Tsimur Hadeliya; Dariusz Kajtoch

Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

Tsimur Hadeliya, Dariusz Kajtoch

TL;DR

The paper benchmarks few-shot classification for native Polish across 7 datasets, evaluating fine-tuning, linear probing, SetFit, and in-context learning. It demonstrates that in-context learning with commercial LLMs yields the best performance in both zero- and few-shot settings, though a sizable gap remains compared to full-data fine-tuning of HerBERT-large. SetFit and linear probing provide robust, data-efficient alternatives, while non-linear fine-tuning proves unstable. The authors release 71 handcrafted ICL templates to support reproducibility and highlight the benefits of continual pre-training on Polish data for zero-shot performance, offering practical guidance for Polish NLP deployment under limited labeled data.

Abstract

We introduce a few-shot benchmark consisting of 7 different classification tasks native to the Polish language. We conducted an empirical comparison with 0 and 16 shots between fine-tuning, linear probing, SetFit, and in-context learning (ICL) using various pre-trained commercial and open-source models. Our findings reveal that ICL achieves the best performance, with commercial models like GPT-3.5 and GPT-4 attaining the best performance. However, there remains a significant 14 percentage points gap between our best few-shot learning score and the performance of HerBERT-large fine-tuned on the entire training dataset. Among the techniques, SetFit emerges as the second-best approach, closely followed by linear probing. We observed the worst and most unstable performance with non-linear head fine-tuning. Results for ICL indicate that continual pre-training of models like Mistral-7b or Llama-2-13b on Polish corpora is beneficial. This is confirmed by the improved performances of Bielik-7b and Trurl-13b, respectively. To further support experiments in few-shot learning for Polish, we are releasing handcrafted templates for the ICL.

Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

TL;DR

Abstract

Paper Structure (47 sections, 6 figures, 9 tables)

This paper contains 47 sections, 6 figures, 9 tables.

Introduction
Related Work
Few-shot benchmarks
Few-shot learning
Large Language Models (LLMs)
Multilingual capabilities of LLMs
Problem Statement
Methodology
Datasets
Evaluation metrics
Experimental setup
Models
Training schemes
Baseline
Linear probing
...and 32 more sections

Figures (6)

Figure 1: Summary of average performance on the few-shot classification benchmark for different training techniques and 16 shots. The last bar on the right reports results for the HerBERT-large fine-tuned on the whole train dataset. We observe that ICL with GPT-3.5 achieves the best average performance followed by SetFit, Linear probing, and fine-tuning with a much smaller SBERT-large model. The gap between HerBERT-large and GPT-3.5 is around 20.4 percentage points
Figure 2: Training and evaluation process of few-shot (as example 4-shot) learning. Both methods begin by sampling $n$ (as $n$-shot) examples from the task's training data using a fixed random seed. We use 5 random seeds for reproducibility and to measure variance. (a) The sampled examples serve as in-context demonstrations. The evaluation prompt combines manually written instructions, demonstrations, and an optional system prompt. The model uses greedy decoding to generate a label for the test example. An exact match compares the generated label with the golden labels. Predictions without a matching label receive a special label. (b) The sampled examples form the training dataset for the specified fine-tuning method (the Full Fine-tuning method utilizes the entire dataset). Subsequently, the fine-tuned model is evaluated on the test dataset in the conventional manner.
Figure 3: Example building blocks of the prompt for in-context learning. Every prompt start with system instruction: "Rozwiązujesz zadanie klasyfikacji dla języka polskiego.". Then, follows a instruction and a sequence of demonstrations. Every block is joined with a newline character. The demonstration part specifies the way to format examples. Specifically, {text} and {labels} are substituted with text and true labels from the dataset, respectively. Label mapping is used to map textual labels into numeric values.
Figure 4: Average performance of the linear probing method based on Ada and SBERT-large embeddings as a function of train dataset size.
Figure 5: Average performance (AVG) for Llama-2 family models. The upper figure denotes the zero-shot performance. The lower figure denotes a few-shot performance with 16 shots. The text model is trained with language modeling on the general corpus. The chat are continually trained on instruction datasets.
...and 1 more figures

Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

TL;DR

Abstract

Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

Authors

TL;DR

Abstract

Table of Contents

Figures (6)