TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task

Özay Ezerceli; Gizem Gümüşçekiçci; Tuğba Erkoç; Berke Özenç

TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task

Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç

TL;DR

This work tackles the gap in Turkish information retrieval by presenting TurkEmbed4Retrieval, a retrieval-specialized variant of TurkEmbed. The authors adopt Matryoshka Representation Learning and a staged training pipeline, including ALL-NLI-TR and STSB-TR pretraining followed by MS-Marco-TR fine-tuning with a Cached Multiple Negatives Ranking Loss. Empirical results on SciFact-TR show TurkEmbed4Retrieval surpassing Turkish-colBERT by substantial margins (e.g., recall improvements up to ~32% and MRR gains around ~28%), establishing a new Turkish IR benchmark. The approach demonstrates the value of language-specific embeddings, large-scale Turkish data, and advanced negative-sampling losses for information retrieval, with future directions toward QA and domain-specific data augmentation.

Abstract

In this work, we introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model originally designed for Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. By fine-tuning the base model on the MS MARCO TR dataset using advanced training techniques, including Matryoshka representation learning and a tailored multiple negatives ranking loss, we achieve SOTA performance for Turkish retrieval tasks. Extensive experiments demonstrate that our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset, thereby establishing a new benchmark for Turkish information retrieval.

TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task

TL;DR

Abstract

TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)