Table of Contents
Fetching ...

AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects

Ahmad Mustapha, Hadi Al-Khansa, Hadi Al-Mubasher, Aya Mourad, Ranam Hamoud, Hasan El-Husseini, Marwah Al-Sakkaf, Mariette Awad

TL;DR

This work tackles the lack of native Arabic benchmarks for evaluating LLM knowledge in STEM by introducing AraSTEM, an 11,637-question MCQ dataset spanning core STEM domains across multiple educational levels and with traceable sources. It assesses several open-source LLMs in a zero-shot setting, employing English-body prompts and Chain-of-Thought prompting, and analyzes calibration and subject-wise performance to reveal substantial challenges and model complementarity. The findings show that even Arabic-exposed models struggle on AraSTEM, with performance tied to data exposure and fine-tuning for instruction, underscoring the need for localized data and benchmark-driven development of Arabic AI. The dataset and analyses provide a foundation for evaluating and guiding the development of Arabic-language STEM knowledge in LLMs, with implications for model training, calibration, and explainability research.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities, not only in generating human-like text, but also in acquiring knowledge. This highlights the need to go beyond the typical Natural Language Processing downstream benchmarks and asses the various aspects of LLMs including knowledge and reasoning. Numerous benchmarks have been developed to evaluate LLMs knowledge, but they predominantly focus on the English language. Given that many LLMs are multilingual, relying solely on benchmarking English knowledge is insufficient. To address this issue, we introduce AraSTEM, a new Arabic multiple-choice question dataset aimed at evaluating LLMs knowledge in STEM subjects. The dataset spans a range of topics at different levels which requires models to demonstrate a deep understanding of scientific Arabic in order to achieve high accuracy. Our findings show that publicly available models of varying sizes struggle with this dataset, and underscores the need for more localized language models. The dataset is freely accessible on Hugging Face.

AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects

TL;DR

This work tackles the lack of native Arabic benchmarks for evaluating LLM knowledge in STEM by introducing AraSTEM, an 11,637-question MCQ dataset spanning core STEM domains across multiple educational levels and with traceable sources. It assesses several open-source LLMs in a zero-shot setting, employing English-body prompts and Chain-of-Thought prompting, and analyzes calibration and subject-wise performance to reveal substantial challenges and model complementarity. The findings show that even Arabic-exposed models struggle on AraSTEM, with performance tied to data exposure and fine-tuning for instruction, underscoring the need for localized data and benchmark-driven development of Arabic AI. The dataset and analyses provide a foundation for evaluating and guiding the development of Arabic-language STEM knowledge in LLMs, with implications for model training, calibration, and explainability research.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities, not only in generating human-like text, but also in acquiring knowledge. This highlights the need to go beyond the typical Natural Language Processing downstream benchmarks and asses the various aspects of LLMs including knowledge and reasoning. Numerous benchmarks have been developed to evaluate LLMs knowledge, but they predominantly focus on the English language. Given that many LLMs are multilingual, relying solely on benchmarking English knowledge is insufficient. To address this issue, we introduce AraSTEM, a new Arabic multiple-choice question dataset aimed at evaluating LLMs knowledge in STEM subjects. The dataset spans a range of topics at different levels which requires models to demonstrate a deep understanding of scientific Arabic in order to achieve high accuracy. Our findings show that publicly available models of varying sizes struggle with this dataset, and underscores the need for more localized language models. The dataset is freely accessible on Hugging Face.
Paper Structure (16 sections, 12 figures, 5 tables)

This paper contains 16 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Semantic Embedding of AraSTEM based on E5 multilingual embedding model. Projected using UMAP
  • Figure 2: A sample from AraSTEM questions corresponding to primary and secondary levels
  • Figure 3: A sample from AraSTEM questions featuring college-level medicine question
  • Figure 4: The distribution of AraSTEM question's word count presented per subject
  • Figure 5: A sample from AraSTEM questions featuring college level ones in both chemistry and biology
  • ...and 7 more figures