Table of Contents
Fetching ...

Generating Effective Ensembles for Sentiment Analysis

Itay Etelis, Avi Rosenfeld, Abraham Itzhak Weinberg, David Sarne

TL;DR

The paper investigates how to push sentiment analysis performance beyond transformer-only ensembles by incorporating heterogeneous base-learners from lexicon-based, bag-of-words, CNN, and transformer families. It introduces the Hierarchical Ensemble Construction (HEC) algorithm, a greedy, simulated-annealing-based method that builds small, complementary subsets of base-learners (3–6 models) from a large pool and aggregates their predictions with weighted voting. Across eight canonical SA datasets, HEC consistently outperforms traditional ensemble methods (WMV, Stacking, Shapley, Bayesian Networks) and transformer-only ensembles, achieving a mean accuracy of $95.71\%$ and reducing the gap to perfect accuracy more than other approaches. When compared to GPT-4, HEC generally wins on average, though GPT-4 can outperform on some datasets, underscoring the practical value of carefully constructed heterogeneous ensembles for robust SA performance and suggesting broader applicability to NLP tasks.

Abstract

In recent years, transformer models have revolutionized Natural Language Processing (NLP), achieving exceptional results across various tasks, including Sentiment Analysis (SA). As such, current state-of-the-art approaches for SA predominantly rely on transformer models alone, achieving impressive accuracy levels on benchmark datasets. In this paper, we show that the key for further improving the accuracy of such ensembles for SA is to include not only transformers, but also traditional NLP models, despite the inferiority of the latter compared to transformer models. However, as we empirically show, this necessitates a change in how the ensemble is constructed, specifically relying on the Hierarchical Ensemble Construction (HEC) algorithm we present. Our empirical studies across eight canonical SA datasets reveal that ensembles incorporating a mix of model types, structured via HEC, significantly outperform traditional ensembles. Finally, we provide a comparative analysis of the performance of the HEC and GPT-4, demonstrating that while GPT-4 closely approaches state-of-the-art SA methods, it remains outperformed by our proposed ensemble strategy.

Generating Effective Ensembles for Sentiment Analysis

TL;DR

The paper investigates how to push sentiment analysis performance beyond transformer-only ensembles by incorporating heterogeneous base-learners from lexicon-based, bag-of-words, CNN, and transformer families. It introduces the Hierarchical Ensemble Construction (HEC) algorithm, a greedy, simulated-annealing-based method that builds small, complementary subsets of base-learners (3–6 models) from a large pool and aggregates their predictions with weighted voting. Across eight canonical SA datasets, HEC consistently outperforms traditional ensemble methods (WMV, Stacking, Shapley, Bayesian Networks) and transformer-only ensembles, achieving a mean accuracy of and reducing the gap to perfect accuracy more than other approaches. When compared to GPT-4, HEC generally wins on average, though GPT-4 can outperform on some datasets, underscoring the practical value of carefully constructed heterogeneous ensembles for robust SA performance and suggesting broader applicability to NLP tasks.

Abstract

In recent years, transformer models have revolutionized Natural Language Processing (NLP), achieving exceptional results across various tasks, including Sentiment Analysis (SA). As such, current state-of-the-art approaches for SA predominantly rely on transformer models alone, achieving impressive accuracy levels on benchmark datasets. In this paper, we show that the key for further improving the accuracy of such ensembles for SA is to include not only transformers, but also traditional NLP models, despite the inferiority of the latter compared to transformer models. However, as we empirically show, this necessitates a change in how the ensemble is constructed, specifically relying on the Hierarchical Ensemble Construction (HEC) algorithm we present. Our empirical studies across eight canonical SA datasets reveal that ensembles incorporating a mix of model types, structured via HEC, significantly outperform traditional ensembles. Finally, we provide a comparative analysis of the performance of the HEC and GPT-4, demonstrating that while GPT-4 closely approaches state-of-the-art SA methods, it remains outperformed by our proposed ensemble strategy.
Paper Structure (26 sections, 4 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 4 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparative performance analysis of Hierarchical Ensemble Construction (HEC) against GPT-4 and other ensemble methods (Weighted Majority Voting - WMV, Stacking) over eight datasets. The left chart demonstrates HEC's superior accuracy in six out of eight datasets when compared to GPT-4. The right chart illustrates HEC's consistent improvement across all datasets relative to WMV and Stacking methods.
  • Figure 2: Number of base-learners included in the HEC ensemble for the different datasets.
  • Figure 3: Percentage inclusion of models by learning algorithm in the HEC ensemble.
  • Figure 4: Percentage inclusion of models by learning algorithm in the WMV ensemble.
  • Figure 5: Percentage of models after summing feature importance across each learning algorithm.