Table of Contents
Fetching ...

Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?

Junyan Zhang, Yiming Huang, Shuliang Liu, Yubo Gao, Xuming Hu

TL;DR

The paper questions the universal drift to LLMs in text classification and systematically compares BERT-like fine-tuning, LLM internal-state usage, and zero-shot inference across six challenging datasets. It finds that BERT-like architectures often outperform LLMs in many tasks while requiring far less compute, whereas LLM-based approaches excel in knowledge-intensive and semantic-deep scenarios, such as truth-conditional evaluation. Through PCA visualization and probing, the authors classify datasets into three types and develop TaMAS, a task-aware strategy that prescribes when to use BERT-like models versus LLMs. This task-driven approach challenges the one-size-fits-all LLM paradigm and promotes efficient, effective model selection in practice, with potential implications for deploying NLP systems in constrained environments where resources and latency matter.

Abstract

The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.

Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLMs?

TL;DR

The paper questions the universal drift to LLMs in text classification and systematically compares BERT-like fine-tuning, LLM internal-state usage, and zero-shot inference across six challenging datasets. It finds that BERT-like architectures often outperform LLMs in many tasks while requiring far less compute, whereas LLM-based approaches excel in knowledge-intensive and semantic-deep scenarios, such as truth-conditional evaluation. Through PCA visualization and probing, the authors classify datasets into three types and develop TaMAS, a task-aware strategy that prescribes when to use BERT-like models versus LLMs. This task-driven approach challenges the one-size-fits-all LLM paradigm and promotes efficient, effective model selection in practice, with potential implications for deploying NLP systems in constrained environments where resources and latency matter.

Abstract

The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing "LLM-centric" trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.

Paper Structure

This paper contains 21 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration of our fine-grained task selection strategy TaMAS.
  • Figure 2: Comparative PCA visualization of hidden states across six datasets: BERT-like models vs. LLMs. T-BASE, T-EMOJI, T-HOMO, LAWS, CODE, HAL refer to ToxiCloakCNBase, ToxiCloakCNEmoji, ToxiCloakCNHomo, LegalText, MaliciousCode, Hallucination datasets.
  • Figure 3: Comparative visualization of hidden states classification separability using single linear probes on all datasets: BERT-like models vs. LLMs. The fundamental difference in how BERT-like models and LLMs process information becomes particularly evident in the layerwise progression of separability.