Table of Contents
Fetching ...

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram

TL;DR

<3-5 sentence high-level summary>Pariksha addresses the core challenge of evaluating multilingual LLMs across culturally diverse data by combining large-scale human judgments with LLM-based assessments in a pairwise and direct-assessment framework. It introduces a culturally nuanced, native-speaker prompt set across 10 Indic languages and evaluates 30 models, constructing leaderboards to analyze agreement and biases between humans and LLM evaluators. The study finds frontier models like GPT-4o and Llama-3 70B performing best overall, while direct-assessment cor situations reveal weaker human-LLM agreement, particularly for culturally nuanced languages, and highlights biases such as self-bias in GPT-based evaluators. The work demonstrates the feasibility and value of scalable multilingual evaluation, while underscoring the need for hybrid human-in-the-loop approaches to capture cultural nuance and ensure reliable judgments.

Abstract

Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyze the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

TL;DR

<3-5 sentence high-level summary>Pariksha addresses the core challenge of evaluating multilingual LLMs across culturally diverse data by combining large-scale human judgments with LLM-based assessments in a pairwise and direct-assessment framework. It introduces a culturally nuanced, native-speaker prompt set across 10 Indic languages and evaluates 30 models, constructing leaderboards to analyze agreement and biases between humans and LLM evaluators. The study finds frontier models like GPT-4o and Llama-3 70B performing best overall, while direct-assessment cor situations reveal weaker human-LLM agreement, particularly for culturally nuanced languages, and highlights biases such as self-bias in GPT-based evaluators. The work demonstrates the feasibility and value of scalable multilingual evaluation, while underscoring the need for hybrid human-in-the-loop approaches to capture cultural nuance and ensure reliable judgments.

Abstract

Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyze the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
Paper Structure (74 sections, 3 equations, 20 figures, 38 tables)

This paper contains 74 sections, 3 equations, 20 figures, 38 tables.

Figures (20)

  • Figure 1: Evaluation pipeline: (1) We curate a diverse set of evaluation prompts with the help of native speakers. (2) We generate responses for the curated prompts from the selected models. (3) We evaluate generated responses in two settings (direct assessment and pairwise comparison) by both humans and an LLM. (4) We construct leaderboards using scores obtained and analyze the agreement between human and LLM evaluators.
  • Figure 2: Comparison of Elo ratings of models across languages evaluated by both humans and an LLM. We group all models into three categories - Indic, Proprietary and Open-Source base LLMs (see Appendix \ref{['sec:model_details']} for more details).
  • Figure 3: Comparison of average Direct Assessment scores across languages evaluated by both humans and an LLM. We group all models into three categories - Indic, Proprietary and Open-Source base LLMs (see Appendix \ref{['sec:model_details']} for more details).
  • Figure 4: RTP-LX Safety Evaluation of Hindi models. We report the fraction of prompt completions judged problematic by GPT-4 Evaluator and the heuristic Toxicity-200 exact match.
  • Figure 5: Language-wise $\kappa$ scores breakdown for Pairwise and Direct Assessment evaluations.
  • ...and 15 more figures