Table of Contents
Fetching ...

Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

Haodong Chen, Shengyao Zhuang, Zheng Yao, Guido Zuccon, Teerapong Leelanupab

TL;DR

An orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks and identifies a universal distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness.

Abstract

Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an $O(1)$ alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions. In this paper, we conduct an orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks. We further identify a universal "bell-curve" distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness. Finally, evaluation on the reasoning-intensive BRIGHT benchmark shows that precisely capturing high-quality in-context attention signals fundamentally reduces the need for model scaling and reinforcement learning: a zero-shot 8B model matches the performance of 14B reinforcement-learned re-rankers, while even a 0.6B model outperforms state-of-the-art generation-based approaches. These findings redefine the efficiency-effectiveness frontier for LLM-based re-ranking and highlight the latent potential of internal signals for complex reasoning ranking tasks. Our code and results are publicly available at https://github.com/ielab/Selective-ICR.

Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

TL;DR

An orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks and identifies a universal distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness.

Abstract

Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions. In this paper, we conduct an orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks. We further identify a universal "bell-curve" distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness. Finally, evaluation on the reasoning-intensive BRIGHT benchmark shows that precisely capturing high-quality in-context attention signals fundamentally reduces the need for model scaling and reinforcement learning: a zero-shot 8B model matches the performance of 14B reinforcement-learned re-rankers, while even a 0.6B model outperforms state-of-the-art generation-based approaches. These findings redefine the efficiency-effectiveness frontier for LLM-based re-ranking and highlight the latent potential of internal signals for complex reasoning ranking tasks. Our code and results are publicly available at https://github.com/ielab/Selective-ICR.
Paper Structure (22 sections, 5 figures, 4 tables)

This paper contains 22 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Layer-wise performance analysis (nDCG@10) on TREC-DL 2019 and 2020. Stars ($\star$) indicate the peak performance achieved by a single layer. Curly brackets {min, max} denote the performance range across all layers. Dashed horizontal lines and values in parentheses $(\cdot)$ represent the nDCG@10 obtained from the full-layer aggregation strategy chen:AttentionLargeLanguage:2025. Square brackets $[l_1\!-\!l_2]$) indicate selected layers for the aggregation interval.
  • Figure 2: Illustration of Selective-ICR with Center-Biased Interval Selection, exemplified by Llama 3.1 8B with 32-layer architecture. Unlike the original ICR, which aggregates attention weights across all layers, Selective-ICR aggregates attention weights from a peak-anchored, center-biased interval to form token-level query scores ($s_Q$). The final relevance score ($s$) is the sum of calibrated scores ($s_Q - s_{cal}$) across all document tokens.
  • Figure 3: Efficiency gains of the Selective-ICR strategy, measured as the percentage reduction in latency relative to All-ICR for the forward pass and the total scoring stage.
  • Figure 4: Detailed layer-wise nDCG@10 trends across diverse BEIR tasks. Each sub-figure represents a specific model's performance trajectory across multiple datasets. Despite the varying semantic domains (e.g., medical, financial, or general knowledge), each model exhibits a highly consistent bell-shaped relevance distribution, peaking within the identified middle-layer intervals.
  • Figure 5: Overall distribution of nDCG@10 scores across the 12 sub-domains of the reasoning-intensive BRIGHT benchmark for Qwen3 variants. The plots reveal a consistent "bell curve" trend across scientific/mathematical domains, with peak performance concentrated in intermediate layers, while the programming syntactic primitive retrieval task (Pony) exhibits a distinct early-layer signal concentration.