TurkEmbed: Turkish Embedding Model on NLI & STS Tasks
Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç
TL;DR
TurkEmbed tackles the challenge of Turkish semantic understanding by reducing reliance on machine-translated data and addressing Turkish morphology through matryoshka representation learning. The authors employ a two-stage training pipeline, first fine-tuning on All-NLI-TR and then on STSB-TR, using a combination of Multiple Negatives Ranking Loss, CoSENT Loss, and Matryoshka Loss with diverse cross-lingual base models. The approach achieves state-of-the-art performance on Turkish NLI (All-NLI-TR) and STS (STSB-TR) benchmarks, demonstrates strong cross-lingual generalization on STS22-Crosslingual-STS Turkish data, and provides competitive inference efficiency relative to larger multilingual baselines. Overall, TurkEmbed offers a robust, resource-efficient path to higher-quality Turkish embeddings with significant implications for downstream Turkish NLP applications.
Abstract
This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4\% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.
