TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task
Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç
TL;DR
This work tackles the gap in Turkish information retrieval by presenting TurkEmbed4Retrieval, a retrieval-specialized variant of TurkEmbed. The authors adopt Matryoshka Representation Learning and a staged training pipeline, including ALL-NLI-TR and STSB-TR pretraining followed by MS-Marco-TR fine-tuning with a Cached Multiple Negatives Ranking Loss. Empirical results on SciFact-TR show TurkEmbed4Retrieval surpassing Turkish-colBERT by substantial margins (e.g., recall improvements up to ~32% and MRR gains around ~28%), establishing a new Turkish IR benchmark. The approach demonstrates the value of language-specific embeddings, large-scale Turkish data, and advanced negative-sampling losses for information retrieval, with future directions toward QA and domain-specific data augmentation.
Abstract
In this work, we introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model originally designed for Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. By fine-tuning the base model on the MS MARCO TR dataset using advanced training techniques, including Matryoshka representation learning and a tailored multiple negatives ranking loss, we achieve SOTA performance for Turkish retrieval tasks. Extensive experiments demonstrate that our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset, thereby establishing a new benchmark for Turkish information retrieval.
