Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

Omer Nacar; Anis Koubaa

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

Omer Nacar, Anis Koubaa

TL;DR

This work tackles the challenge of semantic similarity in Arabic NLP by introducing Matryoshka Representation Learning (MRL), a nested embedding framework that encodes multi-granularity representations within a single vector and remains effective when truncated. It combines multilingual, Arabic-specific, and English-based models, trains them on a translated Arabic NLI triplet corpus, and evaluates them on Arabic STSB using comprehensive metrics across multiple dimensions. The authors translate SNLI/MultiNLI into Arabic, release translated datasets and trained Matryoshka models on Hugging Face, and demonstrate that multilingual embeddings (notably Paraphrase-Multilingual-MPNet-Base-V2) generally outperform Arabic-specific variants at higher dimensions, with substantial gains over base models. The results highlight the value of language-adaptive, nested embeddings for efficient and accurate semantic textual similarity in Arabic, and the accompanying tools enable practical deployment for retrieval and NLP tasks. Overall, the work advances Arabic NLP by providing both a scalable training paradigm and accessible resources for broader research and applications.

Abstract

This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning, leveraging multilingual, Arabic-specific, and English-based models, to highlight the power of nested embeddings models in various Arabic NLP downstream tasks. Our innovative contribution includes the translation of various sentence similarity datasets into Arabic, enabling a comprehensive evaluation framework to compare these models across different dimensions. We trained several nested embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance using multiple evaluation metrics, including Pearson and Spearman correlations for cosine similarity, Manhattan distance, Euclidean distance, and dot product similarity. The results demonstrate the superior performance of the Matryoshka embedding models, particularly in capturing semantic nuances unique to the Arabic language. Results demonstrated that Arabic Matryoshka embedding models have superior performance in capturing semantic nuances unique to the Arabic language, significantly outperforming traditional models by up to 20-25\% across various similarity metrics. These results underscore the effectiveness of language-specific training and highlight the potential of Matryoshka models in enhancing semantic textual similarity tasks for Arabic NLP.

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 6 figures, 14 tables)

This paper contains 17 sections, 1 equation, 6 figures, 14 tables.

Introduction
Related Work
Dataset Preparation
Arabic Dataset
Data Preprocessing
Translation Process
Methodology
Model Selection
Matryoshka Embedding Models
Nested Embedding Training Process
Results & Discussion
Comprehensive Performance Analysis For Each Trained Arabic Matryoshka Embedding Model
Comparative Analysis of Different Arabic Trained Matryoshka Models Performance Across Metrics and Dimensions
Comparison of Base Models Vs. Arabic Trained Matryoshka Models
Analysis of Similarity Scores Predicted by Arabic Trained Matryoshka Models
...and 2 more sections

Figures (6)

Figure 1: Matryoshka Representation Learning Process Kusupati
Figure 2: Truncation Step in Matryoshka Representation Learning
Figure 3: Comparative Analysis of Model Performance Across Different Metrics and Dimensions.
Figure 4: Comparison of Base Models Vs. Trained Matryoshka Models Across Various Metrics
Figure 5: Comparison of Average Predicted Cosine Similarity Scores for Different Similarity Categories
...and 1 more figures

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

TL;DR

Abstract

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)