TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
Özay Ezerceli, Mahmoud El Hussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu, Yusuf Çelebi, Yağız Asker
TL;DR
Turkish information retrieval is underexplored for neural models, especially for late-interaction architectures that preserve token-level semantics. The authors introduce TurkColBERT, a comprehensive benchmark that compares dense encoders and ColBERT-style late-interaction models via a two-stage adaptation pipeline ( Turkish semantic fine-tuning followed by PyLate-based ColBERT adaptation) and MUVERA-enabled efficiency. Across five Turkish BEIR datasets, late-interaction models, notably ColmmBERT-base-TR, outperform dense baselines while achieving strong parameter efficiency; ultra-compact ColBERT variants retain substantial performance. MUVERA indexing yields production-ready, low-latency retrieval (as low as 0.54 ms) with competitive accuracy, enabling scalable Turkish IR. Limitations include dataset size and translated benchmarks, indicating the need for web-scale evaluations and morphology-aware approaches in future work.
Abstract
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
