Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

Martin Juan José Bucher; Marco Martini

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

Martin Juan José Bucher, Marco Martini

TL;DR

The paper addresses whether zero-shot generative AI can replace fine-tuning of small LLMs for text classification. It conducts a systematic, cross-task comparison between fine-tuned RoBERTa-family models and zero-shot prompts from GPT-3.5/4, Claude Opus, and BART across sentiment, stance, and emotion tasks in English and German, with ablation on training data size. The main finding is that fine-tuned small LLMs consistently outperform zero-shot large models, especially on non-standard tasks, and that meaningful performance can be achieved with roughly 200-500 labeled examples. An easy-to-use Hugging Face–based toolkit accompanies the paper, enabling non-experts to fine-tune LLMs for classification tasks and highlighting practical advantages of smaller models in terms of privacy and control in production settings.

Abstract

Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks. This promises to eliminate the need for manually labeled training data and task-specific model training. However, it remains an open question whether tools like ChatGPT can deliver on this promise. In this paper, we show that smaller, fine-tuned LLMs (still) consistently and significantly outperform larger, zero-shot prompted models in text classification. We compare three major generative AI models (ChatGPT with GPT-3.5/GPT-4 and Claude Opus) with several fine-tuned LLMs across a diverse set of classification tasks (sentiment, approval/disapproval, emotions, party positions) and text categories (news, tweets, speeches). We find that fine-tuning with application-specific training data achieves superior performance in all cases. To make this approach more accessible to a broader audience, we provide an easy-to-use toolkit alongside this paper. Our toolkit, accompanied by non-technical step-by-step guidance, enables users to select and fine-tune BERT-like LLMs for any classification task with minimal technical and computational effort.

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 4 tables)

This paper contains 18 sections, 4 figures, 4 tables.

Introduction
Related Work and Contribution
Non-technical Background: From Keywords to Large Language Models
Bag-Of-Words
Word Embeddings
Pre-trained Large Language Models
Method: Fine-tuned LLMs vs. Zero-Shot Generative AI Models
Results
Sentiment Analysis on The New York Times Coverage of the US Economy
Stance Classification on Tweets about Kavanaugh Nomination
Emotion Detection on Political Texts in German
Multi-Class Stance Classification on Parties' EU Positions
Fine-Tuning: The Effect of Training Set Size on Model Performance
Discussion
Conclusion
...and 3 more sections

Figures (4)

Figure 1: Overview of existing text-as-data methods and their characteristics: Machine learning approaches for text classification have the potential to combine the advantages of hand-coding (high-quality) and dictionaries (speed), while avoiding their respective downsides. The degree to which this potential can be realized in practice depends on the underlying text representation (see Figure 2).
Figure 2: Text representation of different text-as-data approaches: Existing approaches differ starkly in the sophistication of their text representation. Pre-trained LLMs approximate a more holistic human text understanding that focuses on meaning (concepts rather than form (wording. This allows LLMs to effectively leverage the information contained in text. By contrast, earlier text-as-data approaches discard significant information due to their more rudimentary language representations.
Figure 3: Schematic representation of the workflow with our toolkit
Figure 4: Effect of training set size on model performance: Results for ROB-LRG with varying number of training observations $N=\{50, 100, 200, 500, 1000\}$. The translucent markers above the 0-point denote the zero-shot results of BART. The rightmost points denote model performance if trained on the full dataset.

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

TL;DR

Abstract

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)