ERUPD -- English to Roman Urdu Parallel Dataset
Mohammed Furqan, Raahid Bin Khaja, Rayyan Habeeb
TL;DR
ERUPD presents a large English–Roman Urdu parallel corpus (75,146 sentence pairs) to address the resource gap for Roman Urdu in NLP. It adopts a hybrid data pipeline that blends synthetic data generated via prompt engineering with real-world WhatsApp conversations, followed by rigorous human refinement to address code-switching, phonetic variability, and synonym diversity. Evaluations using two transformer models (T5-Small and mBART) show competitive translation quality (BLEU ≈ 39 and METEOR ≈ 0.526–0.531), demonstrating ERUPD's utility for machine translation, sentiment analysis, and multilingual education in a low-resource language pair. The dataset enables robust exploration of Roman Urdu phenomena and sets the stage for broader standardization efforts and future diversity-focused expansions.
Abstract
Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine translation, sentiment analysis, and multilingual education.
