Table of Contents
Fetching ...

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan

TL;DR

ArabicNumBench presents a comprehensive evaluation of Arabic number reading by 71 LLMs across dual numeral systems and six contextual categories, using four prompting strategies and two novel metrics for extraction and format preservation. The study demonstrates that few-shot Chain-of-Thought markedly improves accuracy and, more importantly, structured output generation, revealing a dissociation between numerical correctness and instruction-following. A small set of elite models achieve both high accuracy and robust structured outputs, while many top performers still rely on fallback extraction, underscoring production-readiness concerns. The findings urge practitioners to prioritize structured-output performance alongside numeric accuracy and to employ few-shot CoT prompting for production Arabic NLP systems.

Abstract

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

TL;DR

ArabicNumBench presents a comprehensive evaluation of Arabic number reading by 71 LLMs across dual numeral systems and six contextual categories, using four prompting strategies and two novel metrics for extraction and format preservation. The study demonstrates that few-shot Chain-of-Thought markedly improves accuracy and, more importantly, structured output generation, revealing a dissociation between numerical correctness and instruction-following. A small set of elite models achieve both high accuracy and robust structured outputs, while many top performers still rely on fallback extraction, underscoring production-readiness concerns. The findings urge practitioners to prioritize structured-output performance alongside numeric accuracy and to employ few-shot CoT prompting for production Arabic NLP systems.

Abstract

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.
Paper Structure (22 sections, 1 equation, 3 tables)