TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

Özay Ezerceli; Mahmoud El Hussieni; Selva Taş; Reyhan Bayraktar; Fatma Betül Terzioğlu; Yusuf Çelebi; Yağız Asker

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

Özay Ezerceli, Mahmoud El Hussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu, Yusuf Çelebi, Yağız Asker

TL;DR

Turkish information retrieval is underexplored for neural models, especially for late-interaction architectures that preserve token-level semantics. The authors introduce TurkColBERT, a comprehensive benchmark that compares dense encoders and ColBERT-style late-interaction models via a two-stage adaptation pipeline ( Turkish semantic fine-tuning followed by PyLate-based ColBERT adaptation) and MUVERA-enabled efficiency. Across five Turkish BEIR datasets, late-interaction models, notably ColmmBERT-base-TR, outperform dense baselines while achieving strong parameter efficiency; ultra-compact ColBERT variants retain substantial performance. MUVERA indexing yields production-ready, low-latency retrieval (as low as 0.54 ms) with competitive accuracy, enabling scalable Turkish IR. Limitations include dataset size and translated benchmarks, indicating the need for web-scale evaluations and morphology-aware approaches in future work.

Abstract

Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

TL;DR

Abstract

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)