Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
Siva Rajesh Kasa, Karan Gupta, Sumegh Roychowdhury, Ashutosh Kumar, Yaswanth Biruduraju, Santhosh Kumar Kasa, Nikhil Priyatam Pattisapu, Arindam Bhattacharya, Shailendra Agarwal, Vijay huddar
TL;DR
This paper reevaluates discriminative versus generative text classifiers in the transformer era by conducting a large-scale, scratch-trained comparison of encoder-based discriminative models and five generative paradigms: auto-regressive (AR), masked language modeling (MLM), discrete diffusion (DIFF), and pseudo-generative variants. It measures not only accuracy but also sample efficiency, calibration, ordinality, and noise robustness across nine benchmark datasets and multiple model sizes, revealing that the classical two-regime trade-off is nuanced and architecture-dependent, with generative approaches excelling in some low-data or uncertainty-robust contexts and pseudo-generative MLM thriving with ample data. Key findings include that small discriminative encoders excel with limited data, while pseudo-generative MLM dominates when abundant labeled data is available, and that pretraining can erase the two-regime phenomenon, broadening the practical choices for deployment. The study provides concrete guidance for latency-constrained versus data-rich deployments and emphasizes calibration and ordinality as essential considerations for real-world use.
Abstract
The comparison between discriminative and generative classifiers has intrigued researchers since Efron's seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures - Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical 'two regimes' phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.
