Table of Contents
Fetching ...

ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding

Israel Abebe Azime, Atnafu Lambebo Tonja, Tadesse Destaw Belay, Yonas Chanie, Bontu Fufa Balcha, Negasi Haile Abadi, Henok Biadglign Ademtew, Mulubrhan Abebe Nerea, Debela Desalegn Yadeta, Derartu Dagne Geremew, Assefa Atsbiha tesfau, Philipp Slusallek, Thamar Solorio, Dietrich Klakow

TL;DR

ProverbEval introduces a culture-focused LLM evaluation benchmark for low-resource languages, using proverb-based tasks to dissect language and cultural understanding. The study systematically analyzes zero-shot performance, prompt language, and choice-order effects across multiple Ethiopian languages and English, revealing substantial sensitivity to input framing and tokenizer quality rather than model size alone. Key findings show monolingual prompts often outperform cross-lingual ones, translating proverbs to English yields limited gains, and generation tasks benefit from native-language descriptions, with Ge’ez exhibiting unique patterns. The work highlights the need for multilingual, culturally aware evaluation frameworks and provides data and code resources to drive robust development of low-resource language models.

Abstract

With the rapid development of evaluation datasets to assess LLMs understanding across a wide range of subjects and domains, identifying a suitable language understanding benchmark has become increasingly challenging. In this work, we explore LLM evaluation challenges for low-resource language understanding and introduce \proverbeval, LLM evaluation benchmark for low-resource languages, focusing on low-resource language understanding in culture-specific scenarios. We benchmark various LLMs and explore factors that create variability in the benchmarking process. We observed performance variances of up to 50\%, depending on the order in which answer choices were presented in multiple-choice tasks. Native language proverb descriptions significantly improve tasks such as proverb generation, contributing to improved outcomes. Additionally, monolingual evaluations consistently outperformed their cross-lingual counterparts in generation tasks. We argue that special attention must be given to the order of choices, the choice of prompt language, task variability, and generation tasks when creating LLM evaluation benchmarks. Evaluation data available at https://huggingface.co/datasets/israel/ProverbEval, evaluation code https://github.com/EthioNLP/EthioProverbEval.

ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding

TL;DR

ProverbEval introduces a culture-focused LLM evaluation benchmark for low-resource languages, using proverb-based tasks to dissect language and cultural understanding. The study systematically analyzes zero-shot performance, prompt language, and choice-order effects across multiple Ethiopian languages and English, revealing substantial sensitivity to input framing and tokenizer quality rather than model size alone. Key findings show monolingual prompts often outperform cross-lingual ones, translating proverbs to English yields limited gains, and generation tasks benefit from native-language descriptions, with Ge’ez exhibiting unique patterns. The work highlights the need for multilingual, culturally aware evaluation frameworks and provides data and code resources to drive robust development of low-resource language models.

Abstract

With the rapid development of evaluation datasets to assess LLMs understanding across a wide range of subjects and domains, identifying a suitable language understanding benchmark has become increasingly challenging. In this work, we explore LLM evaluation challenges for low-resource language understanding and introduce \proverbeval, LLM evaluation benchmark for low-resource languages, focusing on low-resource language understanding in culture-specific scenarios. We benchmark various LLMs and explore factors that create variability in the benchmarking process. We observed performance variances of up to 50\%, depending on the order in which answer choices were presented in multiple-choice tasks. Native language proverb descriptions significantly improve tasks such as proverb generation, contributing to improved outcomes. Additionally, monolingual evaluations consistently outperformed their cross-lingual counterparts in generation tasks. We argue that special attention must be given to the order of choices, the choice of prompt language, task variability, and generation tasks when creating LLM evaluation benchmarks. Evaluation data available at https://huggingface.co/datasets/israel/ProverbEval, evaluation code https://github.com/EthioNLP/EthioProverbEval.

Paper Structure

This paper contains 51 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Detailed overview of ProverbEval, which consists of three distinct tasks. Native languages include those included in Table \ref{['tab:data-dist']}. Detailed prompt descriptions can be found in Appendix \ref{['app:prompt-details']}.
  • Figure 2: Subword fertility of proverbs for each model’s tokenizer in our study. Models that share the same tokenizers are grouped together. Lower values indicate better performance, as they reflect that words are not being excessively split on average.
  • Figure 3: Average accuracy of fill-the-blank results (0 and 5 shots). Zero-shot and five-shot results are an average of three random shuffles using English prompt.