AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects
Ahmad Mustapha, Hadi Al-Khansa, Hadi Al-Mubasher, Aya Mourad, Ranam Hamoud, Hasan El-Husseini, Marwah Al-Sakkaf, Mariette Awad
TL;DR
This work tackles the lack of native Arabic benchmarks for evaluating LLM knowledge in STEM by introducing AraSTEM, an 11,637-question MCQ dataset spanning core STEM domains across multiple educational levels and with traceable sources. It assesses several open-source LLMs in a zero-shot setting, employing English-body prompts and Chain-of-Thought prompting, and analyzes calibration and subject-wise performance to reveal substantial challenges and model complementarity. The findings show that even Arabic-exposed models struggle on AraSTEM, with performance tied to data exposure and fine-tuning for instruction, underscoring the need for localized data and benchmark-driven development of Arabic AI. The dataset and analyses provide a foundation for evaluating and guiding the development of Arabic-language STEM knowledge in LLMs, with implications for model training, calibration, and explainability research.
Abstract
Large Language Models (LLMs) have shown remarkable capabilities, not only in generating human-like text, but also in acquiring knowledge. This highlights the need to go beyond the typical Natural Language Processing downstream benchmarks and asses the various aspects of LLMs including knowledge and reasoning. Numerous benchmarks have been developed to evaluate LLMs knowledge, but they predominantly focus on the English language. Given that many LLMs are multilingual, relying solely on benchmarking English knowledge is insufficient. To address this issue, we introduce AraSTEM, a new Arabic multiple-choice question dataset aimed at evaluating LLMs knowledge in STEM subjects. The dataset spans a range of topics at different levels which requires models to demonstrate a deep understanding of scientific Arabic in order to achieve high accuracy. Our findings show that publicly available models of varying sizes struggle with this dataset, and underscores the need for more localized language models. The dataset is freely accessible on Hugging Face.
