Table of Contents
Fetching ...

Contrastive Retrieval Heads Improve Attention-Based Re-Ranking

Linh Tran, Yulong Li, Radu Florian, Wei Sun

TL;DR

The paper tackles zero-shot, long-context document re-ranking by identifying a small, high-signal subset of attention heads—CoRe heads—via a contrastive scoring metric that rewards attention to relevant documents while down-weighting attention to irrelevant ones. By aggregating signals from these CoRe heads, the authors achieve state-of-the-art list-wise re-ranking across BEIR and cross-lingual MLDR benchmarks, outperforming both full-head aggregation (ICR) and QR-head baselines. A key finding is that top CoRe heads concentrate in middle transformer layers, enabling effective layer pruning that reduces memory usage and latency without sacrificing accuracy. The approach demonstrates strong cross-model generalization, robustness across datasets, and practical efficiency gains suitable for production-ready retrieval systems.

Abstract

The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.

Contrastive Retrieval Heads Improve Attention-Based Re-Ranking

TL;DR

The paper tackles zero-shot, long-context document re-ranking by identifying a small, high-signal subset of attention heads—CoRe heads—via a contrastive scoring metric that rewards attention to relevant documents while down-weighting attention to irrelevant ones. By aggregating signals from these CoRe heads, the authors achieve state-of-the-art list-wise re-ranking across BEIR and cross-lingual MLDR benchmarks, outperforming both full-head aggregation (ICR) and QR-head baselines. A key finding is that top CoRe heads concentrate in middle transformer layers, enabling effective layer pruning that reduces memory usage and latency without sacrificing accuracy. The approach demonstrates strong cross-model generalization, robustness across datasets, and practical efficiency gains suitable for production-ready retrieval systems.

Abstract

The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.

Paper Structure

This paper contains 31 sections, 7 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: nDCG@10 on Quora top-40. Re-ranker with top $8$ QR heads (QR-R) degrades the re-ranking task compared to ICR which uses all heads, while re-ranker with top $8$ CoRe heads (CoRe-R) outperforms both ICR and QR-R.
  • Figure 2: nDCG@10 on DBPedia top-40 with CoRe-R for Mistral 7B. Aggregated attention signal from fewer heads results in higher score. Re-ranking score peaks with top $9$ CoRe heads.
  • Figure 3: Average nDCG@10 on BEIR benchmark using the attention signal from different number of top retrieval heads.
  • Figure 4: Distribution of $S_{CoRe}$ for all heads in each model. We choose the temperature $t=0.001$ for Mistral 7B and $t=0.1$ for Llama-3.1 8B and Phi-4. The top CoRe heads are concentrated mostly in the middle layers of every model.
  • Figure 5: Average latency and re-ranking accuracy on BEIR benchmark. Pruning $50\%$ of the model's layers does not impact the re-ranking performance while saving $20\%$ inference time.
  • ...and 6 more figures