Sentiment Analysis through LLM Negotiations
Xiaofei Sun, Xiaoya Li, Shengyu Zhang, Shuhe Wang, Fei Wu, Jiwei Li, Tianwei Zhang, Guoyin Wang
TL;DR
This work addresses the limitations of single-LLM sentiment analysis under in-context learning by introducing a multi-LLM negotiation framework. A reasoning-infused generator and an explanation-deriving discriminator collaborate through iterative negotiation, with optional role flips and a third LLM voting mechanism to resolve disagreements. Empirical results on SST-2, MR, Twitter, Yelp, Amazon, and IMDB show consistent gains over vanilla ICL and competitive performance against supervised baselines, highlighting the value of collaborative reasoning for complex linguistic phenomena. The approach demonstrates that leveraging diverse LLM perspectives through consensus can improve accuracy and robustness in sentiment classification tasks.
Abstract
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round under the framework of in-context learning. This framework suffers the key disadvantage that the single-turn output generated by a single LLM might not deliver the perfect decision, just as humans sometimes need multiple attempts to get things right. This is especially true for the task of sentiment analysis where deep reasoning is required to address the complex linguistic phenomenon (e.g., clause composition, irony, etc) in the input. To address this issue, this paper introduces a multi-LLM negotiation framework for sentiment analysis. The framework consists of a reasoning-infused generator to provide decision along with rationale, a explanation-deriving discriminator to evaluate the credibility of the generator. The generator and the discriminator iterate until a consensus is reached. The proposed framework naturally addressed the aforementioned challenge, as we are able to take the complementary abilities of two LLMs, have them use rationale to persuade each other for correction. Experiments on a wide range of sentiment analysis benchmarks (SST-2, Movie Review, Twitter, yelp, amazon, IMDB) demonstrate the effectiveness of proposed approach: it consistently yields better performances than the ICL baseline across all benchmarks, and even superior performances to supervised baselines on the Twitter and movie review datasets.
