Table of Contents
Fetching ...

Text Classification in the LLM Era -- Where do we stand?

Sowmya Vajjala, Shwetali Shimangaud

TL;DR

This paper systematically benchmarks text classification methods in the LLM era across 32 datasets in 8 languages, comparing zero-shot prompting, few-shot fine-tuning, synthetic data generation, and full-data baselines. It employs multiple LLM sources (open and proprietary) and analyzes real vs. synthetic training data, revealing that zero-shot methods are strong for sentimentbut underperform for more complex tasks, while synthetic data from multiple LLMs often matches or surpass zero-shot performance and can rival real data in certain settings. Few-shot fine-tuning generally provides gains over zero-shot, especially as label counts grow, though sentiment classification may not benefit as much; supervised fine-tuning remains the strongest option when data is abundant, albeit with higher compute costs. The findings offer practical guidance for practitioners on method selection across languages, highlighting substantial cross-language disparities and the potential of leveraging multi-source synthetic data to balance performance, cost, and energy use.

Abstract

Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labeled dataset. Our results show that zero-shot approaches do well for sentiment classification, but are outperformed by other approaches for the rest of the tasks, and synthetic data sourced from multiple LLMs can build better classifiers than zero-shot open LLMs. We also see wide performance disparities across languages in all the classification scenarios. We expect that these findings would guide practitioners working on developing text classification systems across languages.

Text Classification in the LLM Era -- Where do we stand?

TL;DR

This paper systematically benchmarks text classification methods in the LLM era across 32 datasets in 8 languages, comparing zero-shot prompting, few-shot fine-tuning, synthetic data generation, and full-data baselines. It employs multiple LLM sources (open and proprietary) and analyzes real vs. synthetic training data, revealing that zero-shot methods are strong for sentimentbut underperform for more complex tasks, while synthetic data from multiple LLMs often matches or surpass zero-shot performance and can rival real data in certain settings. Few-shot fine-tuning generally provides gains over zero-shot, especially as label counts grow, though sentiment classification may not benefit as much; supervised fine-tuning remains the strongest option when data is abundant, albeit with higher compute costs. The findings offer practical guidance for practitioners on method selection across languages, highlighting substantial cross-language disparities and the potential of leveraging multi-source synthetic data to balance performance, cost, and energy use.

Abstract

Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labeled dataset. Our results show that zero-shot approaches do well for sentiment classification, but are outperformed by other approaches for the rest of the tasks, and synthetic data sourced from multiple LLMs can build better classifiers than zero-shot open LLMs. We also see wide performance disparities across languages in all the classification scenarios. We expect that these findings would guide practitioners working on developing text classification systems across languages.

Paper Structure

This paper contains 34 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Zero-shot LLMs versus a logistic regression classifier trained with full data
  • Figure 2: Few-shot Fine-tuning
  • Figure 3: Zero-shot versus synthetic data based Classification
  • Figure 4: zero-shot GPT4, synthetic data based, and real data based classifiers
  • Figure 5: Synthetic versus Real-Data
  • ...and 2 more figures