Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning
Omer Nacar, Anis Koubaa
TL;DR
This work tackles the challenge of semantic similarity in Arabic NLP by introducing Matryoshka Representation Learning (MRL), a nested embedding framework that encodes multi-granularity representations within a single vector and remains effective when truncated. It combines multilingual, Arabic-specific, and English-based models, trains them on a translated Arabic NLI triplet corpus, and evaluates them on Arabic STSB using comprehensive metrics across multiple dimensions. The authors translate SNLI/MultiNLI into Arabic, release translated datasets and trained Matryoshka models on Hugging Face, and demonstrate that multilingual embeddings (notably Paraphrase-Multilingual-MPNet-Base-V2) generally outperform Arabic-specific variants at higher dimensions, with substantial gains over base models. The results highlight the value of language-adaptive, nested embeddings for efficient and accurate semantic textual similarity in Arabic, and the accompanying tools enable practical deployment for retrieval and NLP tasks. Overall, the work advances Arabic NLP by providing both a scalable training paradigm and accessible resources for broader research and applications.
Abstract
This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning, leveraging multilingual, Arabic-specific, and English-based models, to highlight the power of nested embeddings models in various Arabic NLP downstream tasks. Our innovative contribution includes the translation of various sentence similarity datasets into Arabic, enabling a comprehensive evaluation framework to compare these models across different dimensions. We trained several nested embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance using multiple evaluation metrics, including Pearson and Spearman correlations for cosine similarity, Manhattan distance, Euclidean distance, and dot product similarity. The results demonstrate the superior performance of the Matryoshka embedding models, particularly in capturing semantic nuances unique to the Arabic language. Results demonstrated that Arabic Matryoshka embedding models have superior performance in capturing semantic nuances unique to the Arabic language, significantly outperforming traditional models by up to 20-25\% across various similarity metrics. These results underscore the effectiveness of language-specific training and highlight the potential of Matryoshka models in enhancing semantic textual similarity tasks for Arabic NLP.
