Table of Contents
Fetching ...

Weighted KL-Divergence for Document Ranking Model Refinement

Yingrui Yang, Yifan Qiao, Shanxiu He, Tao Yang

TL;DR

This work addresses improving knowledge-distillation-based ranking by introducing a contrastively-weighted KL-divergence (CKL) loss that prioritizes separation of positive and negative documents. CKL reweights KL terms using $(1-q_j)^{\gamma}$ for positives and $(q_i)^{\gamma-\beta_i}$ for negatives with a rank-based bias $\beta_i$, enabling the student model to follow the teacher more when the teacher is better and less when not. The approach yields state-of-the-art or competitive gains on MS MARCO and BEIR across two-stage retrieval pipelines and dense retrievers, demonstrating practical improvements for document ranking. CKL's simple, principled loss modification transfers effectively to both sparse and dense retrieval settings with minimal architectural changes. Overall, CKL offers a robust, scalable enhancement to distillation-based ranking methods.

Abstract

Transformer-based retrieval and reranking models for text document search are often refined through knowledge distillation together with contrastive learning. A tight distribution matching between the teacher and student models can be hard as over-calibration may degrade training effectiveness when a teacher does not perform well. This paper contrastively reweights KL divergence terms to prioritize the alignment between a student and a teacher model for proper separation of positive and negative documents. This paper analyzes and evaluates the proposed loss function on the MS MARCO and BEIR datasets to demonstrate its effectiveness in improving the relevance of tested student models.

Weighted KL-Divergence for Document Ranking Model Refinement

TL;DR

This work addresses improving knowledge-distillation-based ranking by introducing a contrastively-weighted KL-divergence (CKL) loss that prioritizes separation of positive and negative documents. CKL reweights KL terms using for positives and for negatives with a rank-based bias , enabling the student model to follow the teacher more when the teacher is better and less when not. The approach yields state-of-the-art or competitive gains on MS MARCO and BEIR across two-stage retrieval pipelines and dense retrievers, demonstrating practical improvements for document ranking. CKL's simple, principled loss modification transfers effectively to both sparse and dense retrieval settings with minimal architectural changes. Overall, CKL offers a robust, scalable enhancement to distillation-based ranking methods.

Abstract

Transformer-based retrieval and reranking models for text document search are often refined through knowledge distillation together with contrastive learning. A tight distribution matching between the teacher and student models can be hard as over-calibration may degrade training effectiveness when a teacher does not perform well. This paper contrastively reweights KL divergence terms to prioritize the alignment between a student and a teacher model for proper separation of positive and negative documents. This paper analyzes and evaluates the proposed loss function on the MS MARCO and BEIR datasets to demonstrate its effectiveness in improving the relevance of tested student models.
Paper Structure (7 sections, 5 equations, 3 figures, 2 tables)

This paper contains 7 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The weights of CKL terms, sorted in a descending order of student's predictions
  • Figure 2: Relative gradient contribution ratio $g$ of CKL in blue triangle and BKL in red bullets
  • Figure 3: Behavior characteristics of CKL/KL during training