Contrastive Retrieval Heads Improve Attention-Based Re-Ranking
Linh Tran, Yulong Li, Radu Florian, Wei Sun
TL;DR
The paper tackles zero-shot, long-context document re-ranking by identifying a small, high-signal subset of attention heads—CoRe heads—via a contrastive scoring metric that rewards attention to relevant documents while down-weighting attention to irrelevant ones. By aggregating signals from these CoRe heads, the authors achieve state-of-the-art list-wise re-ranking across BEIR and cross-lingual MLDR benchmarks, outperforming both full-head aggregation (ICR) and QR-head baselines. A key finding is that top CoRe heads concentrate in middle transformer layers, enabling effective layer pruning that reduces memory usage and latency without sacrificing accuracy. The approach demonstrates strong cross-model generalization, robustness across datasets, and practical efficiency gains suitable for production-ready retrieval systems.
Abstract
The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.
