Table of Contents
Fetching ...

Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

Siva Rajesh Kasa, Karan Gupta, Sumegh Roychowdhury, Ashutosh Kumar, Yaswanth Biruduraju, Santhosh Kumar Kasa, Nikhil Priyatam Pattisapu, Arindam Bhattacharya, Shailendra Agarwal, Vijay huddar

TL;DR

This paper reevaluates discriminative versus generative text classifiers in the transformer era by conducting a large-scale, scratch-trained comparison of encoder-based discriminative models and five generative paradigms: auto-regressive (AR), masked language modeling (MLM), discrete diffusion (DIFF), and pseudo-generative variants. It measures not only accuracy but also sample efficiency, calibration, ordinality, and noise robustness across nine benchmark datasets and multiple model sizes, revealing that the classical two-regime trade-off is nuanced and architecture-dependent, with generative approaches excelling in some low-data or uncertainty-robust contexts and pseudo-generative MLM thriving with ample data. Key findings include that small discriminative encoders excel with limited data, while pseudo-generative MLM dominates when abundant labeled data is available, and that pretraining can erase the two-regime phenomenon, broadening the practical choices for deployment. The study provides concrete guidance for latency-constrained versus data-rich deployments and emphasizes calibration and ordinality as essential considerations for real-world use.

Abstract

The comparison between discriminative and generative classifiers has intrigued researchers since Efron's seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures - Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical 'two regimes' phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.

Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

TL;DR

This paper reevaluates discriminative versus generative text classifiers in the transformer era by conducting a large-scale, scratch-trained comparison of encoder-based discriminative models and five generative paradigms: auto-regressive (AR), masked language modeling (MLM), discrete diffusion (DIFF), and pseudo-generative variants. It measures not only accuracy but also sample efficiency, calibration, ordinality, and noise robustness across nine benchmark datasets and multiple model sizes, revealing that the classical two-regime trade-off is nuanced and architecture-dependent, with generative approaches excelling in some low-data or uncertainty-robust contexts and pseudo-generative MLM thriving with ample data. Key findings include that small discriminative encoders excel with limited data, while pseudo-generative MLM dominates when abundant labeled data is available, and that pretraining can erase the two-regime phenomenon, broadening the practical choices for deployment. The study provides concrete guidance for latency-constrained versus data-rich deployments and emphasizes calibration and ordinality as essential considerations for real-world use.

Abstract

The comparison between discriminative and generative classifiers has intrigued researchers since Efron's seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures - Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical 'two regimes' phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.

Paper Structure

This paper contains 20 sections, 11 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: [Best viewed in color] Illustration of different modeling paradigms (ENC: Encoder-based classification, MLM: Masked Language Modeling, AR: Auto-Regressive Model, DIFF: Discrete Text Diffusion).
  • Figure 2: [Best viewed in color] Comparison of weighted-F1 scores of models across different configurations ($\uparrow$ is better). For rest of the datasets, refer to Figure \ref{['fig:combined_plots_all_data']} in Appendix \ref{['app:main_results']}. (X-axis: sample size, Y-axis: weighted-F1 score)
  • Figure 2: Minimum noise% needed for X% weighted-F1 drop from the peak under Random Token Dropping. ($\uparrow$ is better)
  • Figure 3: [Best viewed in color] Comparison of weighted-F1 scores between AR$_{pseudo}$ and AR ($\uparrow$ is better). 1-layer results are omitted here as they are mostly trivial in low-data settings. Results for remaining datasets are provided in Figure \ref{['fig:combined_plots_gpts_all_data']}, Appendix \ref{['app:main_results']}. (X-axis: sample size, Y-axis: weighted-F1 score)
  • Figure 4: [Best viewed in color] Calibration and Ordinal performance of 12-layers model on SST-5. For ECE, MCE, MAE, MSE ($\downarrow$ is better) and UM ($\uparrow$ is better) (X-axis: sample size).
  • ...and 9 more figures