Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models
Christos Petridis
TL;DR
This study benchmarks text classification across three paradigms: pre-trained transformer models, standard neural networks, and traditional machine learning algorithms, using TF-IDF and GloVe embeddings. It demonstrates that pre-trained transformers (e.g., BERT, RoBERTa, XLM-RoBERTa) consistently outperform other approaches, especially on the level-1 task, while level-2 remains harder due to more classes. Embedding choice is crucial for non-transformer models, with GloVe providing clear gains over TF-IDF; nonetheless, traditional methods lag behind fine-tuned transformers. The work also highlights the practicality of transfer learning and notes some anomalies (e.g., ALBERT on level-2) and the trade-offs between model size, speed, and accuracy for deployment decisions.
Abstract
Text classification is a very common task nowadays and there are many efficient methods and algorithms that we can employ to accomplish it. Transformers have revolutionized the field of deep learning, particularly in Natural Language Processing (NLP) and have rapidly expanded to other domains such as computer vision, time-series analysis and more. The transformer model was firstly introduced in the context of machine translation and its architecture relies on self-attention mechanisms to capture complex relationships within data sequences. It is able to handle long-range dependencies more effectively than traditional neural networks (such as Recurrent Neural Networks and Multilayer Perceptrons). In this work, we present a comparison between different techniques to perform text classification. We take into consideration seven pre-trained models, three standard neural networks and three machine learning models. For standard neural networks and machine learning models we also compare two embedding techniques: TF-IDF and GloVe, with the latter consistently outperforming the former. Finally, we demonstrate the results from our experiments where pre-trained models such as BERT and DistilBERT always perform better than standard models/algorithms.
