Generative and Discriminative Text Classification with Recurrent Neural Networks
Dani Yogatama, Chris Dyer, Wang Ling, Phil Blunsom
TL;DR
This work compares discriminative and generative LSTM architectures for text classification, focusing on sample efficiency and robustness to distribution shifts. The authors implement a discriminative model that maximizes $p(y\mid \boldsymbol{x})$ and generative models that maximize $p(\boldsymbol{x},y)=p(\boldsymbol{x}\mid y)p(y)$, including a Shared LSTM and an Independent LSTM variant, with a unified base encoder. Empirical results show the discriminative model attains lower asymptotic error, but generative models converge to their higher asymptotic error faster and outperform discriminative models in small-data, continual, and zero-shot settings, indicating better sample efficiency and adaptation. The paper also discusses computational trade-offs, data likelihood as a tool for detecting distribution shifts, and training strategies that enable rapid incorporation of new classes in continual learning. Overall, the findings extend Ng & Jordan's theoretical pattern from linear models to nonlinear LSTMs and highlight the practical advantages of generative approaches for shifting data distributions and low-resource scenarios.
Abstract
We empirically characterize the performance of discriminative and generative LSTM models for text classification. We find that although RNN-based generative models are more powerful than their bag-of-words ancestors (e.g., they account for conditional dependencies across words in a document), they have higher asymptotic error rates than discriminatively trained RNN models. However we also find that generative models approach their asymptotic error rate more rapidly than their discriminative counterparts---the same pattern that Ng & Jordan (2001) proved holds for linear classification models that make more naive conditional independence assumptions. Building on this finding, we hypothesize that RNN-based generative classification models will be more robust to shifts in the data distribution. This hypothesis is confirmed in a series of experiments in zero-shot and continual learning settings that show that generative models substantially outperform discriminative models.
