Table of Contents
Fetching ...

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan, Xiaojuan Qi, Weiming Hu

TL;DR

This work tackles the challenge of achieving high-accuracy image-text retrieval with efficient dual-encoders by distilling knowledge from cross-encoders. It identifies that traditional logit-based distillation struggles due to distribution mismatch, while ranking of hard negatives contains the key transferable information. The authors introduce Contrastive Partial Ranking Distillation (CPRD), which mining hard negatives with a dual-encoder, constructs a partial ranking target from a cross-encoder, and applies a contrastive learning objective to align the dual-encoder’s ranking of valid hard negatives with the cross-encoder. Through extensive experiments on MSCOCO, Flickr30K, and ranking benchmarks, CPRD consistently improves dual-encoder performance, surpassing prior distillation methods and achieving competitive results with cross-encoder approaches while maintaining retrieval efficiency. The method offers a practical, scalable pathway to high-accuracy, efficient image-text retrieval in real-world systems.

Abstract

Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency. Distilling cross-modality matching knowledge from cross-encoder to dual-encoder provides a natural approach to harness their strengths. Thus we investigate the following valuable question: how to make cross-encoder a good teacher for dual-encoder? Our findings are threefold:(1) Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal making vanilla logit distillation less effective. However ranking distillation remains practical as it is not affected by the score distribution.(2) Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.(3) Maintaining the coordination between distillation loss and dual-encoder training loss is beneficial for knowledge transfer. Based on these findings we propose a novel Contrastive Partial Ranking Distillation (CPRD) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning. This approach coordinates with the training of the dual-encoder effectively transferring valid knowledge from the cross-encoder to the dual-encoder. Extensive experiments on image-text retrieval and ranking tasks show that our method surpasses other distillation methods and significantly improves the accuracy of dual-encoder.

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

TL;DR

This work tackles the challenge of achieving high-accuracy image-text retrieval with efficient dual-encoders by distilling knowledge from cross-encoders. It identifies that traditional logit-based distillation struggles due to distribution mismatch, while ranking of hard negatives contains the key transferable information. The authors introduce Contrastive Partial Ranking Distillation (CPRD), which mining hard negatives with a dual-encoder, constructs a partial ranking target from a cross-encoder, and applies a contrastive learning objective to align the dual-encoder’s ranking of valid hard negatives with the cross-encoder. Through extensive experiments on MSCOCO, Flickr30K, and ranking benchmarks, CPRD consistently improves dual-encoder performance, surpassing prior distillation methods and achieving competitive results with cross-encoder approaches while maintaining retrieval efficiency. The method offers a practical, scalable pathway to high-accuracy, efficient image-text retrieval in real-world systems.

Abstract

Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency. Distilling cross-modality matching knowledge from cross-encoder to dual-encoder provides a natural approach to harness their strengths. Thus we investigate the following valuable question: how to make cross-encoder a good teacher for dual-encoder? Our findings are threefold:(1) Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal making vanilla logit distillation less effective. However ranking distillation remains practical as it is not affected by the score distribution.(2) Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.(3) Maintaining the coordination between distillation loss and dual-encoder training loss is beneficial for knowledge transfer. Based on these findings we propose a novel Contrastive Partial Ranking Distillation (CPRD) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning. This approach coordinates with the training of the dual-encoder effectively transferring valid knowledge from the cross-encoder to the dual-encoder. Extensive experiments on image-text retrieval and ranking tasks show that our method surpasses other distillation methods and significantly improves the accuracy of dual-encoder.
Paper Structure (20 sections, 18 equations, 5 figures, 11 tables)

This paper contains 20 sections, 18 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: (a) Similarity score distribution of dual-encoder and cross-encoder. (b) Student predictions and targets for different types of distillation methods. For partial ranking distillation, the relative order between easy negatives is disregarded.
  • Figure 2: The illustration of the Contrastive Partial Ranking Distillation method. The left shows the overall training process and the right side elaborates on the computation process of image-text alignment and contrastive partial ranking distillation.
  • Figure 3: (a) KL-divergence-based distillation targets from cross-encoder. (b) Predicted similarity scores from student dual-encoder after softmax operation.
  • Figure 4: Illustration of image-to-text retrieval of our model and baseline model. Ground-truth captions for each image are in red color.
  • Figure 5: Illustration of text-to-image retrieval results of our model and baseline model. The ground-truth image for each text is in the red box.