Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat, Hadi Khalilia, Gábor Bella, Fausto Giunchiglia
TL;DR
This work tackles quality deficiencies in Arabic WordNets by introducing AWN V3, a major revision that adds glosses and usage examples, fixes incorrect lemmas, and expands coverage. It introduces lexical gaps and phrasets to explicitly handle untranslatability and language diversity, and implements a disciplined two-translator plus expert validation workflow to reduce polysemy and improve reliability. Quantitatively, AWN V3 updates 5,554 synsets from AWN V1, adding 2,726 lemmas, 9,322 glosses, and 12,204 examples, while identifying 236 lexical gaps and 701 phrasets and deleting 8,751 incorrect lemmas; the resulting resource contains 9,576 synsets. The approach emphasizes quality over sheer coverage, aiming to produce a robust, translator-friendly Arabic lexical resource with practical benefits for NLP tasks and cross-language applications, and it provides datasets and methodology for extending AWN to AWN V2 and remaining PWN synsets in future work.
Abstract
High-quality WordNets are crucial for achieving high-quality results in NLP applications that rely on such resources. However, the wordnets of most languages suffer from serious issues of correctness and completeness with respect to the words and word meanings they define, such as incorrect lemmas, missing glosses and example sentences, or an inadequate, Western-centric representation of the morphology and the semantics of the language. Previous efforts have largely focused on increasing lexical coverage while ignoring other qualitative aspects. In this paper, we focus on the Arabic language and introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality. As a result, we updated more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors. In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.
