Table of Contents
Fetching ...

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Hunzalah Hassan Bhatti, Firoj Alam

TL;DR

This work tackles the challenge of evaluating LLMs on culturally grounded content across Arabic dialects by building a multilingual, dialect-rich QA benchmark. It introduces a comprehensive pipeline that (i) translates MSA MCQs from PalmX-GC into English and four Arabic dialects, (ii) converts them into open-ended questions, (iii) benchmarks zero-shot and fine-tuned LLMs on both MCQ and OEQ formats, and (iv) generates chain-of-thought rationales to enable CoT-based fine-tuning. The authors construct an open, parallel corpus across dialects and English, and provide extensive results showing dialect-underperformance, strength of Arabic-centric models on MCQ but not OEQ, and nuanced effects of CoT on semantic versus lexical metrics, with GPT-5 and GPT-4.1 achieving strong performance. The dataset will be publicly released to support linguistically inclusive evaluation, offering a valuable resource for understanding and improving culturally grounded QA across dialectal Arabic.

Abstract

Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

TL;DR

This work tackles the challenge of evaluating LLMs on culturally grounded content across Arabic dialects by building a multilingual, dialect-rich QA benchmark. It introduces a comprehensive pipeline that (i) translates MSA MCQs from PalmX-GC into English and four Arabic dialects, (ii) converts them into open-ended questions, (iii) benchmarks zero-shot and fine-tuned LLMs on both MCQ and OEQ formats, and (iv) generates chain-of-thought rationales to enable CoT-based fine-tuning. The authors construct an open, parallel corpus across dialects and English, and provide extensive results showing dialect-underperformance, strength of Arabic-centric models on MCQ but not OEQ, and nuanced effects of CoT on semantic versus lexical metrics, with GPT-5 and GPT-4.1 achieving strong performance. The dataset will be publicly released to support linguistically inclusive evaluation, offering a valuable resource for understanding and improving culturally grounded QA across dialectal Arabic.

Abstract

Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

Paper Structure

This paper contains 23 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Example QA shown in two formats (MCQ and OEQ). MCQ: Multiple-Choice Question; OEQ: Open-Ended Question. Flags in parentheses indicate representative countries where each dialect is widely spoken.
  • Figure 2: Pipeline for the dataset construction process.