Table of Contents
Fetching ...

ChatGPT vs Gemini vs LLaMA on Multilingual Sentiment Analysis

Alessio Buscemi, Daniele Proverbio

TL;DR

This work investigates multilingual sentiment analysis using four leading LLMs (ChatGPT 3.5, ChatGPT 4, Gemini Pro, LLaMA2 7b) on 20 nuanced scenarios translated into 10 languages, validated against post-hoc human responses. It introduces a standardized, repeatable evaluation framework that includes explicit prompts, human benchmarks, and analysis of safety-filter behavior, revealing cross-language biases and varying handling of irony and sarcasm. Key findings show ChatGPT and Gemini generally align with human sentiment in ambiguous cases, but exhibit language- and version-dependent biases, while LLaMA2 tends to skew positive across languages. The study highlights the need for interpretable data, consistent multilingual performance, and careful consideration of safety filters in downstream sentiment applications.

Abstract

Automated sentiment analysis using Large Language Model (LLM)-based models like ChatGPT, Gemini or LLaMA2 is becoming widespread, both in academic research and in industrial applications. However, assessment and validation of their performance in case of ambiguous or ironic text is still poor. In this study, we constructed nuanced and ambiguous scenarios, we translated them in 10 languages, and we predicted their associated sentiment using popular LLMs. The results are validated against post-hoc human responses. Ambiguous scenarios are often well-coped by ChatGPT and Gemini, but we recognise significant biases and inconsistent performance across models and evaluated human languages. This work provides a standardised methodology for automated sentiment analysis evaluation and makes a call for action to further improve the algorithms and their underlying data, to improve their performance, interpretability and applicability.

ChatGPT vs Gemini vs LLaMA on Multilingual Sentiment Analysis

TL;DR

This work investigates multilingual sentiment analysis using four leading LLMs (ChatGPT 3.5, ChatGPT 4, Gemini Pro, LLaMA2 7b) on 20 nuanced scenarios translated into 10 languages, validated against post-hoc human responses. It introduces a standardized, repeatable evaluation framework that includes explicit prompts, human benchmarks, and analysis of safety-filter behavior, revealing cross-language biases and varying handling of irony and sarcasm. Key findings show ChatGPT and Gemini generally align with human sentiment in ambiguous cases, but exhibit language- and version-dependent biases, while LLaMA2 tends to skew positive across languages. The study highlights the need for interpretable data, consistent multilingual performance, and careful consideration of safety filters in downstream sentiment applications.

Abstract

Automated sentiment analysis using Large Language Model (LLM)-based models like ChatGPT, Gemini or LLaMA2 is becoming widespread, both in academic research and in industrial applications. However, assessment and validation of their performance in case of ambiguous or ironic text is still poor. In this study, we constructed nuanced and ambiguous scenarios, we translated them in 10 languages, and we predicted their associated sentiment using popular LLMs. The results are validated against post-hoc human responses. Ambiguous scenarios are often well-coped by ChatGPT and Gemini, but we recognise significant biases and inconsistent performance across models and evaluated human languages. This work provides a standardised methodology for automated sentiment analysis evaluation and makes a call for action to further improve the algorithms and their underlying data, to improve their performance, interpretability and applicability.
Paper Structure (20 sections, 1 equation, 7 figures)

This paper contains 20 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: The 20 scenarios translated in each of the 10 languages considered in this study
  • Figure 2: Mean rates calculated by ChatGPT 3.5, ChatGPT 4, Gemini Pro and LLaMA2 7b, from all iterations and languages for each considered scenario.
  • Figure 3: Normalized mean rate $R_{\ell}$ scored by each LLM model, and for each language.
  • Figure 4: (a) Mean rates calculated from questionnaires submitted to human native speakers, for each considered scenario and averaged over all languages. (b) Difference between LLM mean rating and human mean rating.
  • Figure 5: The proportion of questionnaire responses, for each language. ZH = Mandarin Chinese, EN = English, FR = French, DE = German, IT = Italian, JA = Japanese, PL = Polish, PT= Portuguese, RU = Russian, ES = Spanish.
  • ...and 2 more figures