PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram
TL;DR
<3-5 sentence high-level summary>Pariksha addresses the core challenge of evaluating multilingual LLMs across culturally diverse data by combining large-scale human judgments with LLM-based assessments in a pairwise and direct-assessment framework. It introduces a culturally nuanced, native-speaker prompt set across 10 Indic languages and evaluates 30 models, constructing leaderboards to analyze agreement and biases between humans and LLM evaluators. The study finds frontier models like GPT-4o and Llama-3 70B performing best overall, while direct-assessment cor situations reveal weaker human-LLM agreement, particularly for culturally nuanced languages, and highlights biases such as self-bias in GPT-based evaluators. The work demonstrates the feasibility and value of scalable multilingual evaluation, while underscoring the need for hybrid human-in-the-loop approaches to capture cultural nuance and ensure reliable judgments.
Abstract
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyze the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
