Table of Contents
Fetching ...

HLTCOE at TREC 2023 NeuCLIR Track

Eugene Yang, Dawn Lawrie, James Mayfield

TL;DR

The paper assesses multiple fine-tuning strategies for multilingual retrieval in NeuCLIR 2023 by combining dense and sparse approaches, notably ColBERT-X with PLAID and an mT5-based reranker. It compares Translate-Train, Translate-Distill, and Multilingual Translate-Train (MTT) across CLIR, MLIR, and Technical Documents, highlighting distillation as a key driver of strong, efficient performance and demonstrating the benefits and trade-offs of multilingual training and index compression. Key findings show that distillation from a high-capacity mT5 reranker yields competitive or superior results with lighter models, while a single multilingual ColBERT-X can handle multiple languages effectively when trained with MTT; domain differences (news vs technical) influence performance patterns and the impact of date metadata. The work provides practical insights into when dense reranking, lexical methods, or multilingual training are advantageous and underscores the value of date-aware topic handling for robust MLIR submissions.

Abstract

The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train~(TT), Translate Distill~(TD) and multilingual translate-train~(MTT). TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-language models for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed-language batches. Thus the model learns about matching queries to passages simultaneously in all languages. Distillation uses scores from the mT5 model over non-English translated document pairs to learn how to score query-document pairs. The team submitted runs to all NeuCLIR tasks: the CLIR and MLIR news task as well as the technical documents task.

HLTCOE at TREC 2023 NeuCLIR Track

TL;DR

The paper assesses multiple fine-tuning strategies for multilingual retrieval in NeuCLIR 2023 by combining dense and sparse approaches, notably ColBERT-X with PLAID and an mT5-based reranker. It compares Translate-Train, Translate-Distill, and Multilingual Translate-Train (MTT) across CLIR, MLIR, and Technical Documents, highlighting distillation as a key driver of strong, efficient performance and demonstrating the benefits and trade-offs of multilingual training and index compression. Key findings show that distillation from a high-capacity mT5 reranker yields competitive or superior results with lighter models, while a single multilingual ColBERT-X can handle multiple languages effectively when trained with MTT; domain differences (news vs technical) influence performance patterns and the impact of date metadata. The work provides practical insights into when dense reranking, lexical methods, or multilingual training are advantageous and underscores the value of date-aware topic handling for robust MLIR submissions.

Abstract

The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train~(TT), Translate Distill~(TD) and multilingual translate-train~(MTT). TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-language models for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed-language batches. Thus the model learns about matching queries to passages simultaneously in all languages. Distillation uses scores from the mT5 model over non-English translated document pairs to learn how to score query-document pairs. The team submitted runs to all NeuCLIR tasks: the CLIR and MLIR news task as well as the technical documents task.
Paper Structure (13 sections, 6 tables)