JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources
Benjamin Clavié
TL;DR
JaColBERTv2.5 tackles the data-limited Japanese retrieval setting by re-engineering the training and inference pipeline for multi-vector ColBERT-style models. The authors introduce dynamic query length, remove in-batch negatives, adopt schedule-free learning, apply rigorous score normalization, and use KL-Divergence distillation from a strong teacher (BGE-M3), complemented by post-training and checkpoint averaging. This yields JaColBERTv2.5 with 110M parameters that achieves a 0.752 average score across five benchmarks, markedly surpassing prior Japanese and multilingual baselines while using far fewer resources. The work also demonstrates that label-free distillation and checkpoint averaging can enhance generalization and mitigate catastrophic forgetting, with publicly released models, data, and intermediate checkpoints to accelerate future research.
Abstract
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.
