A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification
Junichiro Niimi
TL;DR
This work tackles the variability and reproducibility challenges in LLM-based sentiment annotation by proposing a simple ensemble that repeats inferences across multiple seed-driven workers on medium-sized LLMs and aggregates via median. The method, implemented with non-finetuned Llama-family models and one-shot prompts, yields a robust $RMSE$ reduction (0ver $RMSE$) of 18.6% relative to a single 70B model while preserving accuracy and reducing processing time. Key contributions include a practical, training-free ensemble approach that mimics majority voting for human-like consensus, and a demonstration on the Yelp sentiment task showing substantial efficiency gains. The results imply significant practical value for business analytics and broader text classification tasks, with future directions toward prompt variation and weighted ensemble schemes across domains.
Abstract
With the advance of large language models (LLMs), LLMs have been utilized for the various tasks. However, the issues of variability and reproducibility of results from each trial of LLMs have been largely overlooked in existing literature while actual human annotation uses majority voting to resolve disagreements among annotators. Therefore, this study introduces the straightforward ensemble strategy to a sentiment analysis using LLMs. As the results, we demonstrate that the ensemble of multiple inference using medium-sized LLMs produces more robust and accurate results than using a large model with a single attempt with reducing RMSE by 18.6%.
