Table of Contents
Fetching ...

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

Özay Ezerceli, Mahmoud El Hussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu, Yusuf Çelebi, Yağız Asker

TL;DR

Turkish information retrieval is underexplored for neural models, especially for late-interaction architectures that preserve token-level semantics. The authors introduce TurkColBERT, a comprehensive benchmark that compares dense encoders and ColBERT-style late-interaction models via a two-stage adaptation pipeline ( Turkish semantic fine-tuning followed by PyLate-based ColBERT adaptation) and MUVERA-enabled efficiency. Across five Turkish BEIR datasets, late-interaction models, notably ColmmBERT-base-TR, outperform dense baselines while achieving strong parameter efficiency; ultra-compact ColBERT variants retain substantial performance. MUVERA indexing yields production-ready, low-latency retrieval (as low as 0.54 ms) with competitive accuracy, enabling scalable Turkish IR. Limitations include dataset size and translated benchmarks, indicating the need for web-scale evaluations and morphology-aware approaches in future work.

Abstract

Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

TL;DR

Turkish information retrieval is underexplored for neural models, especially for late-interaction architectures that preserve token-level semantics. The authors introduce TurkColBERT, a comprehensive benchmark that compares dense encoders and ColBERT-style late-interaction models via a two-stage adaptation pipeline ( Turkish semantic fine-tuning followed by PyLate-based ColBERT adaptation) and MUVERA-enabled efficiency. Across five Turkish BEIR datasets, late-interaction models, notably ColmmBERT-base-TR, outperform dense baselines while achieving strong parameter efficiency; ultra-compact ColBERT variants retain substantial performance. MUVERA indexing yields production-ready, low-latency retrieval (as low as 0.54 ms) with competitive accuracy, enabling scalable Turkish IR. Limitations include dataset size and translated benchmarks, indicating the need for web-scale evaluations and morphology-aware approaches in future work.

Abstract

Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600 smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5 smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33 faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets (50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Quality–speed trade-off across MUVERA encoding dimensions (128D to 2048D) on SciFact-TR. Higher dimensions lead to faster retrieval but slightly lower NDCG@100. MUVERA+Rerank (128D) recovers near-PLAID quality with 4–5$\times$ speedup.