Table of Contents
Fetching ...

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

Omer Nacar, Anis Koubaa

TL;DR

This work tackles the challenge of semantic similarity in Arabic NLP by introducing Matryoshka Representation Learning (MRL), a nested embedding framework that encodes multi-granularity representations within a single vector and remains effective when truncated. It combines multilingual, Arabic-specific, and English-based models, trains them on a translated Arabic NLI triplet corpus, and evaluates them on Arabic STSB using comprehensive metrics across multiple dimensions. The authors translate SNLI/MultiNLI into Arabic, release translated datasets and trained Matryoshka models on Hugging Face, and demonstrate that multilingual embeddings (notably Paraphrase-Multilingual-MPNet-Base-V2) generally outperform Arabic-specific variants at higher dimensions, with substantial gains over base models. The results highlight the value of language-adaptive, nested embeddings for efficient and accurate semantic textual similarity in Arabic, and the accompanying tools enable practical deployment for retrieval and NLP tasks. Overall, the work advances Arabic NLP by providing both a scalable training paradigm and accessible resources for broader research and applications.

Abstract

This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning, leveraging multilingual, Arabic-specific, and English-based models, to highlight the power of nested embeddings models in various Arabic NLP downstream tasks. Our innovative contribution includes the translation of various sentence similarity datasets into Arabic, enabling a comprehensive evaluation framework to compare these models across different dimensions. We trained several nested embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance using multiple evaluation metrics, including Pearson and Spearman correlations for cosine similarity, Manhattan distance, Euclidean distance, and dot product similarity. The results demonstrate the superior performance of the Matryoshka embedding models, particularly in capturing semantic nuances unique to the Arabic language. Results demonstrated that Arabic Matryoshka embedding models have superior performance in capturing semantic nuances unique to the Arabic language, significantly outperforming traditional models by up to 20-25\% across various similarity metrics. These results underscore the effectiveness of language-specific training and highlight the potential of Matryoshka models in enhancing semantic textual similarity tasks for Arabic NLP.

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

TL;DR

This work tackles the challenge of semantic similarity in Arabic NLP by introducing Matryoshka Representation Learning (MRL), a nested embedding framework that encodes multi-granularity representations within a single vector and remains effective when truncated. It combines multilingual, Arabic-specific, and English-based models, trains them on a translated Arabic NLI triplet corpus, and evaluates them on Arabic STSB using comprehensive metrics across multiple dimensions. The authors translate SNLI/MultiNLI into Arabic, release translated datasets and trained Matryoshka models on Hugging Face, and demonstrate that multilingual embeddings (notably Paraphrase-Multilingual-MPNet-Base-V2) generally outperform Arabic-specific variants at higher dimensions, with substantial gains over base models. The results highlight the value of language-adaptive, nested embeddings for efficient and accurate semantic textual similarity in Arabic, and the accompanying tools enable practical deployment for retrieval and NLP tasks. Overall, the work advances Arabic NLP by providing both a scalable training paradigm and accessible resources for broader research and applications.

Abstract

This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning, leveraging multilingual, Arabic-specific, and English-based models, to highlight the power of nested embeddings models in various Arabic NLP downstream tasks. Our innovative contribution includes the translation of various sentence similarity datasets into Arabic, enabling a comprehensive evaluation framework to compare these models across different dimensions. We trained several nested embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance using multiple evaluation metrics, including Pearson and Spearman correlations for cosine similarity, Manhattan distance, Euclidean distance, and dot product similarity. The results demonstrate the superior performance of the Matryoshka embedding models, particularly in capturing semantic nuances unique to the Arabic language. Results demonstrated that Arabic Matryoshka embedding models have superior performance in capturing semantic nuances unique to the Arabic language, significantly outperforming traditional models by up to 20-25\% across various similarity metrics. These results underscore the effectiveness of language-specific training and highlight the potential of Matryoshka models in enhancing semantic textual similarity tasks for Arabic NLP.
Paper Structure (17 sections, 1 equation, 6 figures, 14 tables)

This paper contains 17 sections, 1 equation, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Matryoshka Representation Learning Process Kusupati
  • Figure 2: Truncation Step in Matryoshka Representation Learning
  • Figure 3: Comparative Analysis of Model Performance Across Different Metrics and Dimensions.
  • Figure 4: Comparison of Base Models Vs. Trained Matryoshka Models Across Various Metrics
  • Figure 5: Comparison of Average Predicted Cosine Similarity Scores for Different Similarity Categories
  • ...and 1 more figures