Table of Contents
Fetching ...

Shallow Cross-Encoders for Low-Latency Retrieval

Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald

TL;DR

This work tackles the latency bottleneck of Cross-Encoders in text retrieval by showing that shallow Cross-Encoders can achieve higher effectiveness than full-scale models under strict latency constraints, because they can score more candidates within the same time budget. A direct training method using generalized Binary Cross-Entropy (gBCE) without knowledge distillation, combined with high negative sampling, stabilizes these shallow models and mitigates overconfidence. Experimental results on MSMARCO-derived training and TREC-DL2019/2020 demonstrate large gains at latencies under 50 ms (e.g., TinyBERT-gBCE at 25 ms achieving NDCG@10 ≈ 0.652 versus MonoBERT-Large ≈ 0.431), while CPU-only inference remains viable with modest performance gaps. The findings suggest shallow Cross-Encoders offer practical, energy-efficient, and scalable solutions for low-latency retrieval systems, with opportunities for further engineering optimization.

Abstract

Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51% gain over MonoBERT-Large. We also show that shallow Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3% compared to GPU inference with 50ms latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.

Shallow Cross-Encoders for Low-Latency Retrieval

TL;DR

This work tackles the latency bottleneck of Cross-Encoders in text retrieval by showing that shallow Cross-Encoders can achieve higher effectiveness than full-scale models under strict latency constraints, because they can score more candidates within the same time budget. A direct training method using generalized Binary Cross-Entropy (gBCE) without knowledge distillation, combined with high negative sampling, stabilizes these shallow models and mitigates overconfidence. Experimental results on MSMARCO-derived training and TREC-DL2019/2020 demonstrate large gains at latencies under 50 ms (e.g., TinyBERT-gBCE at 25 ms achieving NDCG@10 ≈ 0.652 versus MonoBERT-Large ≈ 0.431), while CPU-only inference remains viable with modest performance gaps. The findings suggest shallow Cross-Encoders offer practical, energy-efficient, and scalable solutions for low-latency retrieval systems, with opportunities for further engineering optimization.

Abstract

Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51% gain over MonoBERT-Large. We also show that shallow Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3% compared to GPU inference with 50ms latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.
Paper Structure (10 sections, 6 equations, 5 figures, 3 tables)

This paper contains 10 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Latency/NDCG tradeoffs on the TREC-DL2020 querysetwhen varying the number of retrieved candidates $K$.
  • Figure 2: A typical BERT-based BERT Cross-Encoder. Note that the structure can be adapted to other transformers with slight modification.
  • Figure 3: Latency/NDCG tradeoffs of experimental models when varying the number of candidates from BM25 between 1 and 1000. The shaded area represents the low-latency zone (latency less than 50ms).
  • Figure 4: Predicted probabilities at different ranks for TREC-DL2019 query 146187 "difference between a mcdouble and a double cheeseburger".
  • Figure 5: Comparison of tradeoffs on CPU and GPU, TREC-DL2020 queryset.