Table of Contents
Fetching ...

Large Language Models For Text Classification: Case Study And Comprehensive Review

Arina Kostina, Marios D. Dikaiakos, Dimosthenis Stefanidis, George Pallis

TL;DR

The paper evaluates large language models (LLMs) for text classification across two real-world tasks—multiclass employee location classification and binary fake-news detection—against RoBERTa and traditional ML baselines. It systematically analyzes prompting strategies (zero-shot, few-shot, chain-of-thought, emotional prompting, role-playing, etc.) and measures weighted F1-score alongside inference time to assess practicality. Key findings show that Llama3-70B frequently achieves top performance, particularly in multiclass settings, but at higher latency, while RoBERTa offers strong accuracy with much faster inference; NB and SVM remain competitive baselines for binary tasks. The results highlight the significant impact of prompting, demonstrate that quantization can preserve efficiency for some models, and argue for task- and latency-aware model selection. The study points to future work with broader datasets, more prompting configurations, and cross-domain applications to better understand the drivers of performance differences across models and tasks.

Abstract

Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.

Large Language Models For Text Classification: Case Study And Comprehensive Review

TL;DR

The paper evaluates large language models (LLMs) for text classification across two real-world tasks—multiclass employee location classification and binary fake-news detection—against RoBERTa and traditional ML baselines. It systematically analyzes prompting strategies (zero-shot, few-shot, chain-of-thought, emotional prompting, role-playing, etc.) and measures weighted F1-score alongside inference time to assess practicality. Key findings show that Llama3-70B frequently achieves top performance, particularly in multiclass settings, but at higher latency, while RoBERTa offers strong accuracy with much faster inference; NB and SVM remain competitive baselines for binary tasks. The results highlight the significant impact of prompting, demonstrate that quantization can preserve efficiency for some models, and argue for task- and latency-aware model selection. The study points to future work with broader datasets, more prompting configurations, and cross-domain applications to better understand the drivers of performance differences across models and tasks.

Abstract

Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.
Paper Structure (23 sections, 6 figures, 2 tables)

This paper contains 23 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example Prompt Construction: The base instruction combines with a prompting technique, then is wrapped in a ZS or FS setting, which forms the final prompt sent to the LLM.
  • Figure 2: Boxplot of F1 Score % change that Each Prompt Caused Compared to Basic ZS (FakeNewsNet Dataset)
  • Figure 3: Boxplot of Performance Range for Each Model (FakeNewsNet Dataset)
  • Figure 4: Boxplot of Performance Range for Each Model (Employee Reviews Dataset)
  • Figure 5: Boxplot of F1 Score % change that Each Prompt Caused Compared to Basic ZS (Employee Reviews Dataset)
  • ...and 1 more figures