Table of Contents
Fetching ...

MASRAD: Arabic Terminology Management Corpora with Semi-Automatic Construction

Mahdi Nasser, Laura Sayyah, Fadi A. Zaraket

TL;DR

This work tackles Arabic terminology management by introducing MASRAD, a large parallel term dataset, and MASRAD-Ex, a semi-automatic pipeline for extracting and curating foreign–Arabic term pairs from Arabic books. The method combines semantic, lexical, transliteration, entity, POS, and phonetic features, employing heuristics and Auto-WEKA based ML to rank candidate translations, achieving strong precision and recall in cross-book evaluation (e.g., $P\approx 90.5\%$, $R\approx 92.4\%$ on test). Key contributions include a detailed dataset with $58{,}570$ unique foreign terms across $495$ books, $4{,}347$ unique foreign terms in MASRAD, and $19{,}405$ annotated candidates, along with a reproducible methodology and tools for semi-automatic termbase construction. The approach reduces editor workload, enhances translation consistency, and supports cross-lingual processing, with the best model able to operate locally without depending on LLMs. Future work aims to broaden data sources, further improve semantic similarity, and expand support for diverse term types while maintaining ethical data usage.

Abstract

This paper presents MASRAD, a terminology dataset for Arabic terminology management, and a method with supporting tools for its semi-automatic construction. The entries in MASRAD are $(f,a)$ pairs of foreign (non-Arabic) terms $f$, appearing in specialized, academic and field-specific books next to their Arabic $a$ counterparts. MASRAD-Ex systematically extracts these pairs as a first step to construct MASRAD. MASRAD helps improving term consistency in academic translations and specialized Arabic documents, and automating cross-lingual text processing. MASRAD-Ex leverages translated terms organically occurring in Arabic books, and considers several candidate pairs for each term phrase. The candidate Arabic terms occur next to the foreign terms, and vary in length. MASRAD-Ex computes lexicographic, phonetic, morphological, and semantic similarity metrics for each candidate pair, and uses heuristic, machine learning, and machine learning with post-processing approaches to decide on the best candidate. This paper presents MASRAD after thorough expert review and makes it available to the interested research community. The best performing MASRAD-Ex approach achieved 90.5% precision and 92.4% recall.

MASRAD: Arabic Terminology Management Corpora with Semi-Automatic Construction

TL;DR

This work tackles Arabic terminology management by introducing MASRAD, a large parallel term dataset, and MASRAD-Ex, a semi-automatic pipeline for extracting and curating foreign–Arabic term pairs from Arabic books. The method combines semantic, lexical, transliteration, entity, POS, and phonetic features, employing heuristics and Auto-WEKA based ML to rank candidate translations, achieving strong precision and recall in cross-book evaluation (e.g., , on test). Key contributions include a detailed dataset with unique foreign terms across books, unique foreign terms in MASRAD, and annotated candidates, along with a reproducible methodology and tools for semi-automatic termbase construction. The approach reduces editor workload, enhances translation consistency, and supports cross-lingual processing, with the best model able to operate locally without depending on LLMs. Future work aims to broaden data sources, further improve semantic similarity, and expand support for diverse term types while maintaining ethical data usage.

Abstract

This paper presents MASRAD, a terminology dataset for Arabic terminology management, and a method with supporting tools for its semi-automatic construction. The entries in MASRAD are pairs of foreign (non-Arabic) terms , appearing in specialized, academic and field-specific books next to their Arabic counterparts. MASRAD-Ex systematically extracts these pairs as a first step to construct MASRAD. MASRAD helps improving term consistency in academic translations and specialized Arabic documents, and automating cross-lingual text processing. MASRAD-Ex leverages translated terms organically occurring in Arabic books, and considers several candidate pairs for each term phrase. The candidate Arabic terms occur next to the foreign terms, and vary in length. MASRAD-Ex computes lexicographic, phonetic, morphological, and semantic similarity metrics for each candidate pair, and uses heuristic, machine learning, and machine learning with post-processing approaches to decide on the best candidate. This paper presents MASRAD after thorough expert review and makes it available to the interested research community. The best performing MASRAD-Ex approach achieved 90.5% precision and 92.4% recall.

Paper Structure

This paper contains 20 sections, 11 equations, 1 figure, 9 tables.

Figures (1)

  • Figure 1: An overview of the process