Rosetta Stone at KSAA-RD Shared Task: A Hop From Language Modeling To Word--Definition Alignment
Ahmed ElBakry, Mohamed Gabr, Muhammad ElNokrashy, Badr AlKhamissi
TL;DR
The paper addresses Arabic reverse dictionary construction under the Tip-of-the-Tongue setting by fine-tuning and ensembling multiple Arabic pretrained transformers to predict word embeddings from definitions. For Arabic definitions, models are trained to map to SGNS and ELECTRA embeddings, with CLS-based representations combined via ensemble averaging. Subtask 2 uses English definitions by either cross-lingual alignment to Arabic embeddings or a translate-test pipeline that feeds Arabic translations into the same Arabic models, yielding competitive results. The key finding is that ensembling CamelBERT-MSA and MARBERTv2 delivers the best performance across subtasks, while future work points to cross-lingual enhancements and self-synthesis data augmentation to improve robustness and generalization.
Abstract
A Reverse Dictionary is a tool enabling users to discover a word based on its provided definition, meaning, or description. Such a technique proves valuable in various scenarios, aiding language learners who possess a description of a word without its identity, and benefiting writers seeking precise terminology. These scenarios often encapsulate what is referred to as the "Tip-of-the-Tongue" (TOT) phenomena. In this work, we present our winning solution for the Arabic Reverse Dictionary shared task. This task focuses on deriving a vector representation of an Arabic word from its accompanying description. The shared task encompasses two distinct subtasks: the first involves an Arabic definition as input, while the second employs an English definition. For the first subtask, our approach relies on an ensemble of finetuned Arabic BERT-based models, predicting the word embedding for a given definition. The final representation is obtained through averaging the output embeddings from each model within the ensemble. In contrast, the most effective solution for the second subtask involves translating the English test definitions into Arabic and applying them to the finetuned models originally trained for the first subtask. This straightforward method achieves the highest score across both subtasks.
