Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification
Martin Juan José Bucher, Marco Martini
TL;DR
The paper addresses whether zero-shot generative AI can replace fine-tuning of small LLMs for text classification. It conducts a systematic, cross-task comparison between fine-tuned RoBERTa-family models and zero-shot prompts from GPT-3.5/4, Claude Opus, and BART across sentiment, stance, and emotion tasks in English and German, with ablation on training data size. The main finding is that fine-tuned small LLMs consistently outperform zero-shot large models, especially on non-standard tasks, and that meaningful performance can be achieved with roughly 200-500 labeled examples. An easy-to-use Hugging Face–based toolkit accompanies the paper, enabling non-experts to fine-tune LLMs for classification tasks and highlighting practical advantages of smaller models in terms of privacy and control in production settings.
Abstract
Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks. This promises to eliminate the need for manually labeled training data and task-specific model training. However, it remains an open question whether tools like ChatGPT can deliver on this promise. In this paper, we show that smaller, fine-tuned LLMs (still) consistently and significantly outperform larger, zero-shot prompted models in text classification. We compare three major generative AI models (ChatGPT with GPT-3.5/GPT-4 and Claude Opus) with several fine-tuned LLMs across a diverse set of classification tasks (sentiment, approval/disapproval, emotions, party positions) and text categories (news, tweets, speeches). We find that fine-tuning with application-specific training data achieves superior performance in all cases. To make this approach more accessible to a broader audience, we provide an easy-to-use toolkit alongside this paper. Our toolkit, accompanied by non-technical step-by-step guidance, enables users to select and fine-tune BERT-like LLMs for any classification task with minimal technical and computational effort.
