Generating Effective Ensembles for Sentiment Analysis
Itay Etelis, Avi Rosenfeld, Abraham Itzhak Weinberg, David Sarne
TL;DR
The paper investigates how to push sentiment analysis performance beyond transformer-only ensembles by incorporating heterogeneous base-learners from lexicon-based, bag-of-words, CNN, and transformer families. It introduces the Hierarchical Ensemble Construction (HEC) algorithm, a greedy, simulated-annealing-based method that builds small, complementary subsets of base-learners (3–6 models) from a large pool and aggregates their predictions with weighted voting. Across eight canonical SA datasets, HEC consistently outperforms traditional ensemble methods (WMV, Stacking, Shapley, Bayesian Networks) and transformer-only ensembles, achieving a mean accuracy of $95.71\%$ and reducing the gap to perfect accuracy more than other approaches. When compared to GPT-4, HEC generally wins on average, though GPT-4 can outperform on some datasets, underscoring the practical value of carefully constructed heterogeneous ensembles for robust SA performance and suggesting broader applicability to NLP tasks.
Abstract
In recent years, transformer models have revolutionized Natural Language Processing (NLP), achieving exceptional results across various tasks, including Sentiment Analysis (SA). As such, current state-of-the-art approaches for SA predominantly rely on transformer models alone, achieving impressive accuracy levels on benchmark datasets. In this paper, we show that the key for further improving the accuracy of such ensembles for SA is to include not only transformers, but also traditional NLP models, despite the inferiority of the latter compared to transformer models. However, as we empirically show, this necessitates a change in how the ensemble is constructed, specifically relying on the Hierarchical Ensemble Construction (HEC) algorithm we present. Our empirical studies across eight canonical SA datasets reveal that ensembles incorporating a mix of model types, structured via HEC, significantly outperform traditional ensembles. Finally, we provide a comparative analysis of the performance of the HEC and GPT-4, demonstrating that while GPT-4 closely approaches state-of-the-art SA methods, it remains outperformed by our proposed ensemble strategy.
