Table of Contents
Fetching ...

Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard, Scott Miller

TL;DR

This work introduces Translate-Distill, a training pipeline that distills ranking knowledge from cross-encoders (and translation models) into efficient CLIR dual-encoders for cross-language retrieval. By decoupling input languages across teacher, passage selection, and student components and precomputing translations and scores, the method trains ColBERT-X dual-encoders that approach or exceed Translate-Train performance while remaining computationally efficient at inference. Empirical results on two CLIR benchmarks (NeuCLIR 2022 and HC3) show state-of-the-art end-to-end CLIR performance when using a Mono-mT5XXL or similar teacher scorer, with substantial gains over baselines and favorable comparisons to Retrieve-and-Rerank pipelines. The approach offers practical impact for multilingual search systems by delivering strong CLIR effectiveness without incurring the heavy translation costs typical of end-to-end translation-based pipelines, and it opens avenues for applying distillation from even larger multilingual models in the future.

Abstract

Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.

Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

TL;DR

This work introduces Translate-Distill, a training pipeline that distills ranking knowledge from cross-encoders (and translation models) into efficient CLIR dual-encoders for cross-language retrieval. By decoupling input languages across teacher, passage selection, and student components and precomputing translations and scores, the method trains ColBERT-X dual-encoders that approach or exceed Translate-Train performance while remaining computationally efficient at inference. Empirical results on two CLIR benchmarks (NeuCLIR 2022 and HC3) show state-of-the-art end-to-end CLIR performance when using a Mono-mT5XXL or similar teacher scorer, with substantial gains over baselines and favorable comparisons to Retrieve-and-Rerank pipelines. The approach offers practical impact for multilingual search systems by delivering strong CLIR effectiveness without incurring the heavy translation costs typical of end-to-end translation-based pipelines, and it opens avenues for applying distillation from even larger multilingual models in the future.

Abstract

Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.
Paper Structure (19 sections, 2 equations, 1 figure, 5 tables)

This paper contains 19 sections, 2 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Translate-Distill training pipeline. The white boxes with blue borders are the fixed teacher models. The hatched green box is the trainable student model. Dashed arrows indicate the optional machine translation middle step, i.e., the input text the model receives can either be original or translated, with different translation decisions made for each dashed arrow.