An overview of model uncertainty and variability in LLM-based sentiment analysis. Challenges, mitigation strategies and the role of explainability
David Herrera-Poyatos, Carlos Peláez-González, Cristina Zuheros, Andrés Herrera-Poyatos, Virilo Tejedor, Francisco Herrera, Rosana Montes
TL;DR
The paper tackles the Model Variability Problem (MVP) in LLM-based sentiment analysis, showing that outputs can be unstable across repeated inferences and prompts due to factors like stochastic decoding and data biases. It offers a 12-factor taxonomy of MVP causes, supported by case studies on GPT-4o and Mixtral 8x22B that highlight temperature and prompt sensitivity as key variability amplifiers. The authors discuss explainability (XAI) as a central remedy, proposing strategies such as uncertainty quantification, ensemble consensus, domain-adaptive fine-tuning, bias mitigation, and hybrid lexicon-LLM approaches, alongside robust benchmarking and reproducibility for open-source LLMs. The work aims to guide the development of more reliable, explainable sentiment-analysis systems suitable for high-stakes domains by integrating stability-focused evaluation with practical mitigation techniques.
Abstract
Large Language Models (LLMs) have significantly advanced sentiment analysis, yet their inherent uncertainty and variability pose critical challenges to achieving reliable and consistent outcomes. This paper systematically explores the Model Variability Problem (MVP) in LLM-based sentiment analysis, characterized by inconsistent sentiment classification, polarization, and uncertainty arising from stochastic inference mechanisms, prompt sensitivity, and biases in training data. We analyze the core causes of MVP, presenting illustrative examples and a case study to highlight its impact. In addition, we investigate key challenges and mitigation strategies, paying particular attention to the role of temperature as a driver of output randomness and emphasizing the crucial role of explainability in improving transparency and user trust. By providing a structured perspective on stability, reproducibility, and trustworthiness, this study helps develop more reliable, explainable, and robust sentiment analysis models, facilitating their deployment in high-stakes domains such as finance, healthcare, and policymaking, among others.
