Extending Translate-Train for ColBERT-X to African Language CLIR
Eugene Yang, Dawn J. Lawrie, Paul McNamee, James Mayfield
TL;DR
This work adapts Translate-Train for African CLIR using ColBERT-X, addressing MT limitations with document translation, translated MS MARCO, and targeted language-model fine-tuning. The multi-step pipeline includes MLM pretraining for Yoruba, Translate-Train retrieval fine-tuning, and in-domain JH POLO data generation, comparing English-trained MS MARCO, MT-translated MS MARCO, and MT document indexing. Official results show English ColBERT on MT documents often performs best, while Translate-Train improves ColBERT-X and MLM helps Yoruba, though JH POLO is not consistently beneficial. Unofficial runs further suggest ColBERT-X can be competitive with careful training, highlighting the practicality and remaining challenges of low-resource language CLIR, MT quality, and domain mismatch.
Abstract
This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting.
