Table of Contents
Fetching ...

Adam: Dense Retrieval Distillation with Adaptive Dark Examples

Chongyang Tao, Chang Liu, Tao Shen, Can Xu, Xiubo Geng, Binxing Jiao, Daxin Jiang

TL;DR

This work addresses the gap in knowledge distillation for dense retrieval by exploiting the cross-encoder's dark knowledge through Adaptive Dark Examples (ADAM). By constructing dark examples that have moderate relevance via reinforced negatives and noisy positives, and by applying a self-paced, confidence-driven distillation strategy, the method smooths the teacher's output distribution and enhances knowledge transfer to the dual-encoder. Experiments on MS-MARCO and TREC DL 2019 show substantial improvements over strong baselines, and ablations confirm the necessity of dark examples and adaptive data selection. The approach yields robust gains across multiple cross-encoder teachers and demonstrates good zero-shot transfer properties, indicating practical value for scalable IR systems.

Abstract

To improve the performance of the dual-encoder retriever, one effective approach is knowledge distillation from the cross-encoder ranker. Existing works construct the candidate passages following the supervised learning setting where a query is paired with a positive passage and a batch of negatives. However, through empirical observation, we find that even the hard negatives from advanced methods are still too trivial for the teacher to distinguish, preventing the teacher from transferring abundant dark knowledge to the student through its soft label. To alleviate this issue, we propose ADAM, a knowledge distillation framework that can better transfer the dark knowledge held in the teacher with Adaptive Dark exAMples. Different from previous works that only rely on one positive and hard negatives as candidate passages, we create dark examples that all have moderate relevance to the query through mixing-up and masking in discrete space. Furthermore, as the quality of knowledge held in different training instances varies as measured by the teacher's confidence score, we propose a self-paced distillation strategy that adaptively concentrates on a subset of high-quality instances to conduct our dark-example-based knowledge distillation to help the student learn better. We conduct experiments on two widely-used benchmarks and verify the effectiveness of our method.

Adam: Dense Retrieval Distillation with Adaptive Dark Examples

TL;DR

This work addresses the gap in knowledge distillation for dense retrieval by exploiting the cross-encoder's dark knowledge through Adaptive Dark Examples (ADAM). By constructing dark examples that have moderate relevance via reinforced negatives and noisy positives, and by applying a self-paced, confidence-driven distillation strategy, the method smooths the teacher's output distribution and enhances knowledge transfer to the dual-encoder. Experiments on MS-MARCO and TREC DL 2019 show substantial improvements over strong baselines, and ablations confirm the necessity of dark examples and adaptive data selection. The approach yields robust gains across multiple cross-encoder teachers and demonstrates good zero-shot transfer properties, indicating practical value for scalable IR systems.

Abstract

To improve the performance of the dual-encoder retriever, one effective approach is knowledge distillation from the cross-encoder ranker. Existing works construct the candidate passages following the supervised learning setting where a query is paired with a positive passage and a batch of negatives. However, through empirical observation, we find that even the hard negatives from advanced methods are still too trivial for the teacher to distinguish, preventing the teacher from transferring abundant dark knowledge to the student through its soft label. To alleviate this issue, we propose ADAM, a knowledge distillation framework that can better transfer the dark knowledge held in the teacher with Adaptive Dark exAMples. Different from previous works that only rely on one positive and hard negatives as candidate passages, we create dark examples that all have moderate relevance to the query through mixing-up and masking in discrete space. Furthermore, as the quality of knowledge held in different training instances varies as measured by the teacher's confidence score, we propose a self-paced distillation strategy that adaptively concentrates on a subset of high-quality instances to conduct our dark-example-based knowledge distillation to help the student learn better. We conduct experiments on two widely-used benchmarks and verify the effectiveness of our method.
Paper Structure (23 sections, 10 equations, 3 figures, 3 tables)

This paper contains 23 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Distributions of the prediction for the cross-encoder of R$^2$anker zhou2022towards over MS-MARCO. POS and NEG mean the distribution of positive and hard negatives respectively. The hard negatives are provided by RocketQAv2 Ren2021RocketQAv2.
  • Figure 2: Illustration of dark examples. The solid rectangle and triangles mean the gold passage and the negative passages respectively. Dotted rectangles and circles denote noisy positives and mixed samples respectively.
  • Figure 3: (a) The impact of $m$; (b) Distributions of model prediction for the R$^2$anker over MS-MARCO.