Table of Contents
Fetching ...

Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

Tsimur Hadeliya, Dariusz Kajtoch

TL;DR

The paper benchmarks few-shot classification for native Polish across 7 datasets, evaluating fine-tuning, linear probing, SetFit, and in-context learning. It demonstrates that in-context learning with commercial LLMs yields the best performance in both zero- and few-shot settings, though a sizable gap remains compared to full-data fine-tuning of HerBERT-large. SetFit and linear probing provide robust, data-efficient alternatives, while non-linear fine-tuning proves unstable. The authors release 71 handcrafted ICL templates to support reproducibility and highlight the benefits of continual pre-training on Polish data for zero-shot performance, offering practical guidance for Polish NLP deployment under limited labeled data.

Abstract

We introduce a few-shot benchmark consisting of 7 different classification tasks native to the Polish language. We conducted an empirical comparison with 0 and 16 shots between fine-tuning, linear probing, SetFit, and in-context learning (ICL) using various pre-trained commercial and open-source models. Our findings reveal that ICL achieves the best performance, with commercial models like GPT-3.5 and GPT-4 attaining the best performance. However, there remains a significant 14 percentage points gap between our best few-shot learning score and the performance of HerBERT-large fine-tuned on the entire training dataset. Among the techniques, SetFit emerges as the second-best approach, closely followed by linear probing. We observed the worst and most unstable performance with non-linear head fine-tuning. Results for ICL indicate that continual pre-training of models like Mistral-7b or Llama-2-13b on Polish corpora is beneficial. This is confirmed by the improved performances of Bielik-7b and Trurl-13b, respectively. To further support experiments in few-shot learning for Polish, we are releasing handcrafted templates for the ICL.

Evaluation of Few-Shot Learning for Classification Tasks in the Polish Language

TL;DR

The paper benchmarks few-shot classification for native Polish across 7 datasets, evaluating fine-tuning, linear probing, SetFit, and in-context learning. It demonstrates that in-context learning with commercial LLMs yields the best performance in both zero- and few-shot settings, though a sizable gap remains compared to full-data fine-tuning of HerBERT-large. SetFit and linear probing provide robust, data-efficient alternatives, while non-linear fine-tuning proves unstable. The authors release 71 handcrafted ICL templates to support reproducibility and highlight the benefits of continual pre-training on Polish data for zero-shot performance, offering practical guidance for Polish NLP deployment under limited labeled data.

Abstract

We introduce a few-shot benchmark consisting of 7 different classification tasks native to the Polish language. We conducted an empirical comparison with 0 and 16 shots between fine-tuning, linear probing, SetFit, and in-context learning (ICL) using various pre-trained commercial and open-source models. Our findings reveal that ICL achieves the best performance, with commercial models like GPT-3.5 and GPT-4 attaining the best performance. However, there remains a significant 14 percentage points gap between our best few-shot learning score and the performance of HerBERT-large fine-tuned on the entire training dataset. Among the techniques, SetFit emerges as the second-best approach, closely followed by linear probing. We observed the worst and most unstable performance with non-linear head fine-tuning. Results for ICL indicate that continual pre-training of models like Mistral-7b or Llama-2-13b on Polish corpora is beneficial. This is confirmed by the improved performances of Bielik-7b and Trurl-13b, respectively. To further support experiments in few-shot learning for Polish, we are releasing handcrafted templates for the ICL.
Paper Structure (47 sections, 6 figures, 9 tables)

This paper contains 47 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Summary of average performance on the few-shot classification benchmark for different training techniques and 16 shots. The last bar on the right reports results for the HerBERT-large fine-tuned on the whole train dataset. We observe that ICL with GPT-3.5 achieves the best average performance followed by SetFit, Linear probing, and fine-tuning with a much smaller SBERT-large model. The gap between HerBERT-large and GPT-3.5 is around 20.4 percentage points
  • Figure 2: Training and evaluation process of few-shot (as example 4-shot) learning. Both methods begin by sampling $n$ (as $n$-shot) examples from the task's training data using a fixed random seed. We use 5 random seeds for reproducibility and to measure variance. (a) The sampled examples serve as in-context demonstrations. The evaluation prompt combines manually written instructions, demonstrations, and an optional system prompt. The model uses greedy decoding to generate a label for the test example. An exact match compares the generated label with the golden labels. Predictions without a matching label receive a special label. (b) The sampled examples form the training dataset for the specified fine-tuning method (the Full Fine-tuning method utilizes the entire dataset). Subsequently, the fine-tuned model is evaluated on the test dataset in the conventional manner.
  • Figure 3: Example building blocks of the prompt for in-context learning. Every prompt start with system instruction: "Rozwiązujesz zadanie klasyfikacji dla języka polskiego.". Then, follows a instruction and a sequence of demonstrations. Every block is joined with a newline character. The demonstration part specifies the way to format examples. Specifically, {text} and {labels} are substituted with text and true labels from the dataset, respectively. Label mapping is used to map textual labels into numeric values.
  • Figure 4: Average performance of the linear probing method based on Ada and SBERT-large embeddings as a function of train dataset size.
  • Figure 5: Average performance (AVG) for Llama-2 family models. The upper figure denotes the zero-shot performance. The lower figure denotes a few-shot performance with 16 shots. The text model is trained with language modeling on the general corpus. The chat are continually trained on instruction datasets.
  • ...and 1 more figures