Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance
Branislav Pecher, Ivan Srba, Maria Bielikova
TL;DR
This study compares data-efficient learning strategies for text classification by pitting specialised small LMs (via fine-tuning or instruction-tuning) against general large LMs (via prompting or in-context learning). Using eight LM configurations and eight datasets with varied characteristics, it defines break-even points as the labelled-sample thresholds where the specialised approach overtakes the general model on average, while accounting for randomness. Key results show that specialised small approaches typically reach parity with around 100 labelled samples, though performance variance can inflate data requirements by 100–200%, and in some cases the second break-even point is not reached within full datasets. 4-bit quantisation yields substantial compute savings with negligible impact on performance variance, and larger models do not consistently yield better data-efficiency. The authors provide practical recommendations for selecting between prompting, fine-tuning, and instruction-tuning and for budgeting annotation and compute resources under real-world constraints.
Abstract
When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question -- how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average $100$) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by $100 - 200\%$. Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.
