Table of Contents
Fetching ...

A thorough benchmark of automatic text classification: From traditional approaches to large language models

Washington Cunha, Leonardo Rocha, Marcos André Gonçalves

TL;DR

This study provides a rigorous cost-benefit benchmark for automatic text classification by evaluating twelve traditional and recent ATC solutions, including five open LLMs, across 22 real-world datasets. It employs cross-validation, Macro-F1 as the primary metric, and analyzes both effectiveness and computational cost (training/prediction time) plus carbon emissions. The results show LLMs deliver the best effectiveness (average gains around 7.2% over traditional methods, up to 26%), but incur substantial costs (about 590x slower than traditional methods and ~8.5x slower than SLMs) and high CO2e emissions (~961 kg total). The authors provide practical guidance: use LLMs when maximum accuracy justifies cost, rely on traditional models for cost-constrained scenarios, and opt for SLMs like RoBERTa for a strong effectiveness-efficiency balance, while releasing a reproducible benchmark for future work. This work has significant practical impact by clarifying when advanced LLMs are warranted and by enabling the community to extend a transparent, open benchmark for ATC.

Abstract

Automatic text classification (ATC) has experienced remarkable advancements in the past decade, best exemplified by recent small and large language models (SLMs and LLMs), leveraged by Transformer architectures. Despite recent effectiveness improvements, a comprehensive cost-benefit analysis investigating whether the effectiveness gains of these recent approaches compensate their much higher costs when compared to more traditional text classification approaches such as SVMs and Logistic Regression is still missing in the literature. In this context, this work's main contributions are twofold: (i) we provide a scientifically sound comparative analysis of the cost-benefit of twelve traditional and recent ATC solutions including five open LLMs, and (ii) a large benchmark comprising {22 datasets}, including sentiment analysis and topic classification, with their (train-validation-test) partitions based on folded cross-validation procedures, along with documentation, and code. The release of code, data, and documentation enables the community to replicate experiments and advance the field in a more scientifically sound manner. Our comparative experimental results indicate that LLMs outperform traditional approaches (up to 26%-7.1% on average) and SLMs (up to 4.9%-1.9% on average) in terms of effectiveness. However, LLMs incur significantly higher computational costs due to fine-tuning, being, on average 590x and 8.5x slower than traditional methods and SLMs, respectively. Results suggests the following recommendations: (1) LLMs for applications that require the best possible effectiveness and can afford the costs; (2) traditional methods such as Logistic Regression and SVM for resource-limited applications or those that cannot afford the cost of tuning large LLMs; and (3) SLMs like Roberta for near-optimal effectiveness-efficiency trade-off.

A thorough benchmark of automatic text classification: From traditional approaches to large language models

TL;DR

This study provides a rigorous cost-benefit benchmark for automatic text classification by evaluating twelve traditional and recent ATC solutions, including five open LLMs, across 22 real-world datasets. It employs cross-validation, Macro-F1 as the primary metric, and analyzes both effectiveness and computational cost (training/prediction time) plus carbon emissions. The results show LLMs deliver the best effectiveness (average gains around 7.2% over traditional methods, up to 26%), but incur substantial costs (about 590x slower than traditional methods and ~8.5x slower than SLMs) and high CO2e emissions (~961 kg total). The authors provide practical guidance: use LLMs when maximum accuracy justifies cost, rely on traditional models for cost-constrained scenarios, and opt for SLMs like RoBERTa for a strong effectiveness-efficiency balance, while releasing a reproducible benchmark for future work. This work has significant practical impact by clarifying when advanced LLMs are warranted and by enabling the community to extend a transparent, open benchmark for ATC.

Abstract

Automatic text classification (ATC) has experienced remarkable advancements in the past decade, best exemplified by recent small and large language models (SLMs and LLMs), leveraged by Transformer architectures. Despite recent effectiveness improvements, a comprehensive cost-benefit analysis investigating whether the effectiveness gains of these recent approaches compensate their much higher costs when compared to more traditional text classification approaches such as SVMs and Logistic Regression is still missing in the literature. In this context, this work's main contributions are twofold: (i) we provide a scientifically sound comparative analysis of the cost-benefit of twelve traditional and recent ATC solutions including five open LLMs, and (ii) a large benchmark comprising {22 datasets}, including sentiment analysis and topic classification, with their (train-validation-test) partitions based on folded cross-validation procedures, along with documentation, and code. The release of code, data, and documentation enables the community to replicate experiments and advance the field in a more scientifically sound manner. Our comparative experimental results indicate that LLMs outperform traditional approaches (up to 26%-7.1% on average) and SLMs (up to 4.9%-1.9% on average) in terms of effectiveness. However, LLMs incur significantly higher computational costs due to fine-tuning, being, on average 590x and 8.5x slower than traditional methods and SLMs, respectively. Results suggests the following recommendations: (1) LLMs for applications that require the best possible effectiveness and can afford the costs; (2) traditional methods such as Logistic Regression and SVM for resource-limited applications or those that cannot afford the cost of tuning large LLMs; and (3) SLMs like Roberta for near-optimal effectiveness-efficiency trade-off.

Paper Structure

This paper contains 11 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: CO2e: the equivalent amount of carbon dioxide (in kg) generated by the classification models’ fine-tuning.
  • Figure 2: Cost (log2 training time)-Effectiveness (MacroF1) Trade-off for each dataset.