Table of Contents
Fetching ...

Leveraging Estimated Transferability Over Human Intuition for Model Selection in Text Ranking

Jun Bai, Zhuofan Chen, Zhenzi Li, Hanhua Hong, Jianfei Zhang, Chen Li, Chenghua Lin, Wenge Rong

TL;DR

The expected rank is computed as transferability, explicitly reflecting the model’s ranking capability, to mitigate anisotropy and incorporate training dynamics, and adaptively scale isotropic sentence embeddings to yield an accurate expected rank score.

Abstract

Text ranking has witnessed significant advancements, attributed to the utilization of dual-encoder enhanced by Pre-trained Language Models (PLMs). Given the proliferation of available PLMs, selecting the most effective one for a given dataset has become a non-trivial challenge. As a promising alternative to human intuition and brute-force fine-tuning, Transferability Estimation (TE) has emerged as an effective approach to model selection. However, current TE methods are primarily designed for classification tasks, and their estimated transferability may not align well with the objectives of text ranking. To address this challenge, we propose to compute the expected rank as transferability, explicitly reflecting the model's ranking capability. Furthermore, to mitigate anisotropy and incorporate training dynamics, we adaptively scale isotropic sentence embeddings to yield an accurate expected rank score. Our resulting method, Adaptive Ranking Transferability (AiRTran), can effectively capture subtle differences between models. On challenging model selection scenarios across various text ranking datasets, it demonstrates significant improvements over previous classification-oriented TE methods, human intuition, and ChatGPT with minor time consumption.

Leveraging Estimated Transferability Over Human Intuition for Model Selection in Text Ranking

TL;DR

The expected rank is computed as transferability, explicitly reflecting the model’s ranking capability, to mitigate anisotropy and incorporate training dynamics, and adaptively scale isotropic sentence embeddings to yield an accurate expected rank score.

Abstract

Text ranking has witnessed significant advancements, attributed to the utilization of dual-encoder enhanced by Pre-trained Language Models (PLMs). Given the proliferation of available PLMs, selecting the most effective one for a given dataset has become a non-trivial challenge. As a promising alternative to human intuition and brute-force fine-tuning, Transferability Estimation (TE) has emerged as an effective approach to model selection. However, current TE methods are primarily designed for classification tasks, and their estimated transferability may not align well with the objectives of text ranking. To address this challenge, we propose to compute the expected rank as transferability, explicitly reflecting the model's ranking capability. Furthermore, to mitigate anisotropy and incorporate training dynamics, we adaptively scale isotropic sentence embeddings to yield an accurate expected rank score. Our resulting method, Adaptive Ranking Transferability (AiRTran), can effectively capture subtle differences between models. On challenging model selection scenarios across various text ranking datasets, it demonstrates significant improvements over previous classification-oriented TE methods, human intuition, and ChatGPT with minor time consumption.
Paper Structure (42 sections, 12 equations, 7 figures, 9 tables)

This paper contains 42 sections, 12 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: This is the pipeline of model selection in text ranking using AiRTran. First, the queries and documents are encoded to sentence embeddings by each candidate model $\phi$. Then, the raw embeddings are transformed by whitening and adaptive scaling sequentially. Finally, the transformed embeddings coupled with labels are used to compute the expected rank as transferability, resulting in the selection of the best-performing model.
  • Figure 2: The performance variations of AiRTran over different document sizes.
  • Figure 3: This is the comparison between the time consumption of all methods as the size of candidate documents grows. Note that the encoding time for dataset is not included, since it is shared by all methods.
  • Figure 4: The comparison of Kendall’s $\tau$ between AiRTran, human intuition, and ChatGPT.
  • Figure 5: The predictions of AiRTran against the fine-tuning results with the best Kendall's $\tau$ performance.
  • ...and 2 more figures