Text Classification in the LLM Era -- Where do we stand?
Sowmya Vajjala, Shwetali Shimangaud
TL;DR
This paper systematically benchmarks text classification methods in the LLM era across 32 datasets in 8 languages, comparing zero-shot prompting, few-shot fine-tuning, synthetic data generation, and full-data baselines. It employs multiple LLM sources (open and proprietary) and analyzes real vs. synthetic training data, revealing that zero-shot methods are strong for sentimentbut underperform for more complex tasks, while synthetic data from multiple LLMs often matches or surpass zero-shot performance and can rival real data in certain settings. Few-shot fine-tuning generally provides gains over zero-shot, especially as label counts grow, though sentiment classification may not benefit as much; supervised fine-tuning remains the strongest option when data is abundant, albeit with higher compute costs. The findings offer practical guidance for practitioners on method selection across languages, highlighting substantial cross-language disparities and the potential of leveraging multi-source synthetic data to balance performance, cost, and energy use.
Abstract
Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labeled dataset. Our results show that zero-shot approaches do well for sentiment classification, but are outperformed by other approaches for the rest of the tasks, and synthetic data sourced from multiple LLMs can build better classifiers than zero-shot open LLMs. We also see wide performance disparities across languages in all the classification scenarios. We expect that these findings would guide practitioners working on developing text classification systems across languages.
