Table of Contents
Fetching ...

Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

Branislav Pecher, Ivan Srba, Maria Bielikova

TL;DR

This study compares data-efficient learning strategies for text classification by pitting specialised small LMs (via fine-tuning or instruction-tuning) against general large LMs (via prompting or in-context learning). Using eight LM configurations and eight datasets with varied characteristics, it defines break-even points as the labelled-sample thresholds where the specialised approach overtakes the general model on average, while accounting for randomness. Key results show that specialised small approaches typically reach parity with around 100 labelled samples, though performance variance can inflate data requirements by 100–200%, and in some cases the second break-even point is not reached within full datasets. 4-bit quantisation yields substantial compute savings with negligible impact on performance variance, and larger models do not consistently yield better data-efficiency. The authors provide practical recommendations for selecting between prompting, fine-tuning, and instruction-tuning and for budgeting annotation and compute resources under real-world constraints.

Abstract

When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question -- how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $100$) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200\%$. Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.

Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

TL;DR

This study compares data-efficient learning strategies for text classification by pitting specialised small LMs (via fine-tuning or instruction-tuning) against general large LMs (via prompting or in-context learning). Using eight LM configurations and eight datasets with varied characteristics, it defines break-even points as the labelled-sample thresholds where the specialised approach overtakes the general model on average, while accounting for randomness. Key results show that specialised small approaches typically reach parity with around 100 labelled samples, though performance variance can inflate data requirements by 100–200%, and in some cases the second break-even point is not reached within full datasets. 4-bit quantisation yields substantial compute savings with negligible impact on performance variance, and larger models do not consistently yield better data-efficiency. The authors provide practical recommendations for selecting between prompting, fine-tuning, and instruction-tuning and for budgeting annotation and compute resources under real-world constraints.

Abstract

When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question -- how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average ) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by . Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.
Paper Structure (23 sections, 7 figures, 3 tables)

This paper contains 23 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison between the performance of specialised small and general large language models. The break-even points are identified by observing the impact of changing the number of available labelled samples and taking performance variance into consideration. Specialised models outperform general ones with only few labelled samples (up to $100$), with performance variance showing strong impact on the comparison, increasing the number significantly.
  • Figure 2: The impact of varying size of available labelled samples (in logarithmic scale) on the performance of fine-tuning, prompting, in-context learning and instruction-tuning approaches, reported using F1 macro and its standard deviation. For each approach, we select only the best performing model. We can observe that specialised models can often outperform general models with only a relatively small number of labelled samples ($10 - 1000$).
  • Figure 3: A showcase of the dataset dependence of the break-even points for specific models. The models that perform well on one dataset may perform significantly worse on others, due to the different characteristics, such as the number of classes, sentence length, task type or whether the dataset was used as part of the model pre-training.
  • Figure 4: The comparison between 4-bit quantised and non-quantised Mistral-7B and Zephyr-7B models used for in-context learning across all datasets. The impact of quantisation is not consistent across datasets. The quantised models often achieve better performance than non-quantised ones, with the difference being often small. In addition, the impact on the variance is negligible.
  • Figure 5: The impact of varying size of available labelled training samples (in logarithmic scale) on the performance of fine-tuning, prompting, in-context learning and instruction-tuning approaches across the binary (SST2, MRPC, CoLA, BoolQ) and multi-class datasets (AG News, TREC, SNIPS, DB Pedia), reported using F1 macro and its standard deviation. For each approach, we group the models based on their size into small (Flan-T5), medium (Mistral/Zephyr) and large (LLaMA/GPT) and report the best performing model for each group. Even though the effect of model size is often significant, it does not follow common assumption. The smaller models, especially in prompting, in-context learning and instruction-tuning, often achieve better performance than the medium or large general models.
  • ...and 2 more figures