Table of Contents
Fetching ...

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

Martin Juan José Bucher, Marco Martini

TL;DR

The paper addresses whether zero-shot generative AI can replace fine-tuning of small LLMs for text classification. It conducts a systematic, cross-task comparison between fine-tuned RoBERTa-family models and zero-shot prompts from GPT-3.5/4, Claude Opus, and BART across sentiment, stance, and emotion tasks in English and German, with ablation on training data size. The main finding is that fine-tuned small LLMs consistently outperform zero-shot large models, especially on non-standard tasks, and that meaningful performance can be achieved with roughly 200-500 labeled examples. An easy-to-use Hugging Face–based toolkit accompanies the paper, enabling non-experts to fine-tune LLMs for classification tasks and highlighting practical advantages of smaller models in terms of privacy and control in production settings.

Abstract

Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks. This promises to eliminate the need for manually labeled training data and task-specific model training. However, it remains an open question whether tools like ChatGPT can deliver on this promise. In this paper, we show that smaller, fine-tuned LLMs (still) consistently and significantly outperform larger, zero-shot prompted models in text classification. We compare three major generative AI models (ChatGPT with GPT-3.5/GPT-4 and Claude Opus) with several fine-tuned LLMs across a diverse set of classification tasks (sentiment, approval/disapproval, emotions, party positions) and text categories (news, tweets, speeches). We find that fine-tuning with application-specific training data achieves superior performance in all cases. To make this approach more accessible to a broader audience, we provide an easy-to-use toolkit alongside this paper. Our toolkit, accompanied by non-technical step-by-step guidance, enables users to select and fine-tune BERT-like LLMs for any classification task with minimal technical and computational effort.

Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification

TL;DR

The paper addresses whether zero-shot generative AI can replace fine-tuning of small LLMs for text classification. It conducts a systematic, cross-task comparison between fine-tuned RoBERTa-family models and zero-shot prompts from GPT-3.5/4, Claude Opus, and BART across sentiment, stance, and emotion tasks in English and German, with ablation on training data size. The main finding is that fine-tuned small LLMs consistently outperform zero-shot large models, especially on non-standard tasks, and that meaningful performance can be achieved with roughly 200-500 labeled examples. An easy-to-use Hugging Face–based toolkit accompanies the paper, enabling non-experts to fine-tune LLMs for classification tasks and highlighting practical advantages of smaller models in terms of privacy and control in production settings.

Abstract

Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks. This promises to eliminate the need for manually labeled training data and task-specific model training. However, it remains an open question whether tools like ChatGPT can deliver on this promise. In this paper, we show that smaller, fine-tuned LLMs (still) consistently and significantly outperform larger, zero-shot prompted models in text classification. We compare three major generative AI models (ChatGPT with GPT-3.5/GPT-4 and Claude Opus) with several fine-tuned LLMs across a diverse set of classification tasks (sentiment, approval/disapproval, emotions, party positions) and text categories (news, tweets, speeches). We find that fine-tuning with application-specific training data achieves superior performance in all cases. To make this approach more accessible to a broader audience, we provide an easy-to-use toolkit alongside this paper. Our toolkit, accompanied by non-technical step-by-step guidance, enables users to select and fine-tune BERT-like LLMs for any classification task with minimal technical and computational effort.
Paper Structure (18 sections, 4 figures, 4 tables)

This paper contains 18 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of existing text-as-data methods and their characteristics: Machine learning approaches for text classification have the potential to combine the advantages of hand-coding (high-quality) and dictionaries (speed), while avoiding their respective downsides. The degree to which this potential can be realized in practice depends on the underlying text representation (see Figure 2).
  • Figure 2: Text representation of different text-as-data approaches: Existing approaches differ starkly in the sophistication of their text representation. Pre-trained LLMs approximate a more holistic human text understanding that focuses on meaning (concepts rather than form (wording. This allows LLMs to effectively leverage the information contained in text. By contrast, earlier text-as-data approaches discard significant information due to their more rudimentary language representations.
  • Figure 3: Schematic representation of the workflow with our toolkit
  • Figure 4: Effect of training set size on model performance: Results for ROB-LRG with varying number of training observations $N=\{50, 100, 200, 500, 1000\}$. The translucent markers above the 0-point denote the zero-shot results of BART. The rightmost points denote model performance if trained on the full dataset.