Table of Contents
Fetching ...

Extending Translate-Train for ColBERT-X to African Language CLIR

Eugene Yang, Dawn J. Lawrie, Paul McNamee, James Mayfield

TL;DR

This work adapts Translate-Train for African CLIR using ColBERT-X, addressing MT limitations with document translation, translated MS MARCO, and targeted language-model fine-tuning. The multi-step pipeline includes MLM pretraining for Yoruba, Translate-Train retrieval fine-tuning, and in-domain JH POLO data generation, comparing English-trained MS MARCO, MT-translated MS MARCO, and MT document indexing. Official results show English ColBERT on MT documents often performs best, while Translate-Train improves ColBERT-X and MLM helps Yoruba, though JH POLO is not consistently beneficial. Unofficial runs further suggest ColBERT-X can be competitive with careful training, highlighting the practicality and remaining challenges of low-resource language CLIR, MT quality, and domain mismatch.

Abstract

This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting.

Extending Translate-Train for ColBERT-X to African Language CLIR

TL;DR

This work adapts Translate-Train for African CLIR using ColBERT-X, addressing MT limitations with document translation, translated MS MARCO, and targeted language-model fine-tuning. The multi-step pipeline includes MLM pretraining for Yoruba, Translate-Train retrieval fine-tuning, and in-domain JH POLO data generation, comparing English-trained MS MARCO, MT-translated MS MARCO, and MT document indexing. Official results show English ColBERT on MT documents often performs best, while Translate-Train improves ColBERT-X and MLM helps Yoruba, though JH POLO is not consistently beneficial. Unofficial runs further suggest ColBERT-X can be competitive with careful training, highlighting the practicality and remaining challenges of low-resource language CLIR, MT quality, and domain mismatch.

Abstract

This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting.
Paper Structure (10 sections, 2 figures, 4 tables)

This paper contains 10 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: GPT-4 prompt used to create JH POLO training examples.
  • Figure 2: GPT-4 output used to create JH POLO training examples.