Table of Contents
Fetching ...

Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

TL;DR

Khayyam Challenge (PersianMMLU) provides the first native Persian, metadata-rich benchmark for evaluating LLMs across 38 subjects with 20,192 MCQs, designed to avoid translation artifacts and to capture question difficulty, educational stage, descriptive reasoning, and trap questions. The authors evaluate nine LLMs, comparing extraction methods and examining translation quality, few-shot limitations, and chain-of-thought effects, with GPT-4 achieving the strongest performance yet still falling short of human-level reasoning by about 35%. Key findings include GPT-4’s relative strength across domains and its robustness to traps, the importance of original Persian content, and the significant performance gap in mathematics and other reasoning-heavy areas. The work establishes Khayyam as a scalable, domain-rich framework that enables deeper analysis of Persian-language AI capabilities and informs future model development and evaluation in low-resource languages.

Abstract

Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.

Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

TL;DR

Khayyam Challenge (PersianMMLU) provides the first native Persian, metadata-rich benchmark for evaluating LLMs across 38 subjects with 20,192 MCQs, designed to avoid translation artifacts and to capture question difficulty, educational stage, descriptive reasoning, and trap questions. The authors evaluate nine LLMs, comparing extraction methods and examining translation quality, few-shot limitations, and chain-of-thought effects, with GPT-4 achieving the strongest performance yet still falling short of human-level reasoning by about 35%. Key findings include GPT-4’s relative strength across domains and its robustness to traps, the importance of original Persian content, and the significant performance gap in mathematics and other reasoning-heavy areas. The work establishes Khayyam as a scalable, domain-rich framework that enables deeper analysis of Persian-language AI capabilities and informs future model development and evaluation in low-resource languages.

Abstract

Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.
Paper Structure (31 sections, 1 equation, 32 figures, 16 tables)

This paper contains 31 sections, 1 equation, 32 figures, 16 tables.

Figures (32)

  • Figure 1: Question distribution with respect to categories, level of difficulty, and educational stages. LPS: Lower Primary School, UPS: Upper Primary School, LSS: Lower Secondary School, USS: Upper Secondary School
  • Figure 2: Comparison of accuracy across main categories for humans and various models
  • Figure 3: Distribution of questions across publication year and educational stage
  • Figure 4: Question distribution across all tasks by difficulty level
  • Figure 5: Sample prompt-0 with English translation for enhanced readability
  • ...and 27 more figures