Table of Contents
Fetching ...

ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis

James P. Beno

TL;DR

This work investigates a collaborative sentiment analysis approach by feeding predictions from fine-tuned ELECTRA encoders into GPT-4o family prompts for three-way sentiment classification. Using a merged SST-3 and DynaSent dataset, the study systematically evaluates baselines, fine-tuning, and DSPy-driven prompt augmentations, revealing that non-fine-tuned GPTs benefit significantly from ELECTRA-derived context while fine-tuned GPTs may negate these gains at higher costs. GPT-4o-M delivers the top performance, with GPT-4o-mini FT offering near-parity at a fraction of the cost; a cost-efficient alternative is ELECTRA Base FT combined with GPT-4o-mini. The results provide practical guidelines for resource-constrained sentiment analysis projects and demonstrate that encoder–LLM collaboration can substantially boost performance with favorable cost profiles. The work suggests future extensions to other domains, datasets, and open-source GPT families, as well as deeper exploration of prompt design and explainability in collaborative setups.

Abstract

Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.50 macro F1 vs. 79.14 ELECTRA Base FT, 79.41 GPT-4o-mini) and yielded the lowest cost/performance ratio (\$0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.70) at much less cost (\$0.38 vs. \$1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.

ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis

TL;DR

This work investigates a collaborative sentiment analysis approach by feeding predictions from fine-tuned ELECTRA encoders into GPT-4o family prompts for three-way sentiment classification. Using a merged SST-3 and DynaSent dataset, the study systematically evaluates baselines, fine-tuning, and DSPy-driven prompt augmentations, revealing that non-fine-tuned GPTs benefit significantly from ELECTRA-derived context while fine-tuned GPTs may negate these gains at higher costs. GPT-4o-M delivers the top performance, with GPT-4o-mini FT offering near-parity at a fraction of the cost; a cost-efficient alternative is ELECTRA Base FT combined with GPT-4o-mini. The results provide practical guidelines for resource-constrained sentiment analysis projects and demonstrate that encoder–LLM collaboration can substantially boost performance with favorable cost profiles. The work suggests future extensions to other domains, datasets, and open-source GPT families, as well as deeper exploration of prompt design and explainability in collaborative setups.

Abstract

Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.50 macro F1 vs. 79.14 ELECTRA Base FT, 79.41 GPT-4o-mini) and yielded the lowest cost/performance ratio (\0.38 vs. \$1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.
Paper Structure (21 sections, 11 figures, 13 tables)

This paper contains 21 sections, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Change in Mean F1 from Adding Predictions
  • Figure 2: Mean Macro F1 Scores on Merged Dataset by Experiment
  • Figure 3: Impact of Context on Mean F1 Score
  • Figure 4: Incorrect Predictions by Label (Round 1)
  • Figure 5: Basic Prompt DSPy Signature
  • ...and 6 more figures