Table of Contents
Fetching ...

MuDRiC: Multi-Dialect Reasoning for Arabic Commonsense Validation

Kareem Elozeiri, Mervat Abassy, Preslav Nakov, Yuxia Wang

TL;DR

The paper tackles the lack of Arabic dialect coverage in commonsense validation and introduces MuDRiC, a first multi-dialect benchmark spanning Egyptian, Gulf, Levantine, and Moroccan dialects. It also proposes a graph-based augmentation that fuses a GCN over a word co-occurrence graph with MLM embeddings to better model cross-dialect semantics. Experiments show that GCN-enhanced models outperform plain fine-tuning across MSA and four dialects, while domain-adversarial training to enforce dialect invariance degrades performance, highlighting the value of dialect-aware representations. This work provides both a valuable dataset and a method to improve dialect-robust Arabic NLP, with potential impact on fact-checking, misinformation detection, and safer dialogue systems.

Abstract

Commonsense validation evaluates whether a sentence aligns with everyday human understanding, a critical capability for developing robust natural language understanding systems. While substantial progress has been made in English, the task remains underexplored in Arabic, particularly given its rich linguistic diversity. Existing Arabic resources have primarily focused on Modern Standard Arabic (MSA), leaving regional dialects underrepresented despite their prevalence in spoken contexts. To bridge this gap, we present two key contributions. We introduce MuDRiC, an extended Arabic commonsense dataset incorporating multiple dialects. To the best of our knowledge, this is the first Arabic multi-dialect commonsense reasoning dataset. We further propose a novel method adapting Graph Convolutional Networks (GCNs) to Arabic commonsense reasoning, which enhances semantic relationship modeling for improved commonsense validation. Our experimental results demonstrate that this approach consistently outperforms the baseline of direct language model fine-tuning. Overall, our work enhances Arabic natural language understanding by providing a foundational dataset and a new method for handling its complex variations. Data and code are available at https://github.com/KareemElozeiri/MuDRiC.

MuDRiC: Multi-Dialect Reasoning for Arabic Commonsense Validation

TL;DR

The paper tackles the lack of Arabic dialect coverage in commonsense validation and introduces MuDRiC, a first multi-dialect benchmark spanning Egyptian, Gulf, Levantine, and Moroccan dialects. It also proposes a graph-based augmentation that fuses a GCN over a word co-occurrence graph with MLM embeddings to better model cross-dialect semantics. Experiments show that GCN-enhanced models outperform plain fine-tuning across MSA and four dialects, while domain-adversarial training to enforce dialect invariance degrades performance, highlighting the value of dialect-aware representations. This work provides both a valuable dataset and a method to improve dialect-robust Arabic NLP, with potential impact on fact-checking, misinformation detection, and safer dialogue systems.

Abstract

Commonsense validation evaluates whether a sentence aligns with everyday human understanding, a critical capability for developing robust natural language understanding systems. While substantial progress has been made in English, the task remains underexplored in Arabic, particularly given its rich linguistic diversity. Existing Arabic resources have primarily focused on Modern Standard Arabic (MSA), leaving regional dialects underrepresented despite their prevalence in spoken contexts. To bridge this gap, we present two key contributions. We introduce MuDRiC, an extended Arabic commonsense dataset incorporating multiple dialects. To the best of our knowledge, this is the first Arabic multi-dialect commonsense reasoning dataset. We further propose a novel method adapting Graph Convolutional Networks (GCNs) to Arabic commonsense reasoning, which enhances semantic relationship modeling for improved commonsense validation. Our experimental results demonstrate that this approach consistently outperforms the baseline of direct language model fine-tuning. Overall, our work enhances Arabic natural language understanding by providing a foundational dataset and a new method for handling its complex variations. Data and code are available at https://github.com/KareemElozeiri/MuDRiC.

Paper Structure

This paper contains 41 sections, 1 equation, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: The creation process of the graph representation for a sentence.
  • Figure 2: BERT Model with Graph Embeddings Fusion.