Table of Contents
Fetching ...

Selecting Between BERT and GPT for Text Classification in Political Science Research

Yu Wang, Wen Qu, Xin Ye

TL;DR

While zero-shot and few-shot learning with GPT models provide reasonable performance and are well-suited for early-stage research exploration, they generally fall short - or, at best, match - the performance of BERT fine-tuning, particularly as the training set reaches a substantial size.

Abstract

Political scientists often grapple with data scarcity in text classification. Recently, fine-tuned BERT models and their variants have gained traction as effective solutions to address this issue. In this study, we investigate the potential of GPT-based models combined with prompt engineering as a viable alternative. We conduct a series of experiments across various classification tasks, differing in the number of classes and complexity, to evaluate the effectiveness of BERT-based versus GPT-based models in low-data scenarios. Our findings indicate that while zero-shot and few-shot learning with GPT models provide reasonable performance and are well-suited for early-stage research exploration, they generally fall short - or, at best, match - the performance of BERT fine-tuning, particularly as the training set reaches a substantial size (e.g., 1,000 samples). We conclude by comparing these approaches in terms of performance, ease of use, and cost, providing practical guidance for researchers facing data limitations. Our results are particularly relevant for those engaged in quantitative text analysis in low-resource settings or with limited labeled data.

Selecting Between BERT and GPT for Text Classification in Political Science Research

TL;DR

While zero-shot and few-shot learning with GPT models provide reasonable performance and are well-suited for early-stage research exploration, they generally fall short - or, at best, match - the performance of BERT fine-tuning, particularly as the training set reaches a substantial size.

Abstract

Political scientists often grapple with data scarcity in text classification. Recently, fine-tuned BERT models and their variants have gained traction as effective solutions to address this issue. In this study, we investigate the potential of GPT-based models combined with prompt engineering as a viable alternative. We conduct a series of experiments across various classification tasks, differing in the number of classes and complexity, to evaluate the effectiveness of BERT-based versus GPT-based models in low-data scenarios. Our findings indicate that while zero-shot and few-shot learning with GPT models provide reasonable performance and are well-suited for early-stage research exploration, they generally fall short - or, at best, match - the performance of BERT fine-tuning, particularly as the training set reaches a substantial size (e.g., 1,000 samples). We conclude by comparing these approaches in terms of performance, ease of use, and cost, providing practical guidance for researchers facing data limitations. Our results are particularly relevant for those engaged in quantitative text analysis in low-resource settings or with limited labeled data.

Paper Structure

This paper contains 18 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Increasing the number of samples enhances model accuracy, whether it's through fine-tuning BERT models or prompting GPT models. 'RoBERTa # 200' refers to fine-tuning RoBERTa-large with 200 samples, while 'Temp0.2 #0' indicates zero-shot prompting with a temperature setting of 0.2. The black vertical error bar represents the range from the minimum to the maximum values. Few-shot prompting with two samples performs about the same as finetuning RoBERTa-large with 1,000 samples. Finetuning yields a higher variance in test evaluations than prompting. A lower temperature setting of 0.2 yields slightly better performance than a higher temperature of 0.8 for prompting.
  • Figure 2: Prompting with or without samples lags behind fine-tuning RoBERTa-large models by a sizeable margin in the 8-class manifesto classification.
  • Figure 3: Finetuning BERT models substantially outperforms prompting GPT models in the 8-class New Zealand Parliamentary Speech classification. While fine-tuning continues to show significant improvement with the addition of more training samples, prompting appears to gain no benefit from embedding extra samples into the prompts.
  • Figure 4: Fine-tuning BERT models with 500 samples performs comparably to prompting GPT models in the 20-class COVID-19 policy measure classification task. With fine-tuning continuing to show significant improvement with the addition of more training samples, fine-tuning with 1,000 samples clearly has an edge over prompting, which apparently is not benefiting from the extra added samples.
  • Figure 5: In the 22-class classification of the US State of the Union speeches, zero-shot prompting outperforms finetuning BERT with 200 samples. 1-shot and 2-shot prompting perform similarly to finetuning with 500 and 1,000 samples, respectively.
  • ...and 1 more figures