Table of Contents
Fetching ...

Scalable In-context Ranking with Generative Models

Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu

TL;DR

This paper tackles the efficiency bottleneck of in-context ranking (ICR) with large language models by revealing structured attention patterns in finetuned LLMs and designing BlockRank, a blockwise attention scheme that reduces complexity from $O((N+2)L^2d)$ per layer to near linear in the number of documents $N$. It combines a structured sparse attention architecture with a contrastive auxiliary loss (InfoNCE) to explicitly optimize query-to-document signaling and enable a fast attention-based inference path. Empirical results on BEIR, MSMarco, and Natural Questions show BlockRank matches or surpasses strong baselines while delivering substantial inference speedups (e.g., ~4.7x at $N=100$, scaling to 500 docs within ~1s). The method demonstrates strong zero-shot generalization and competitive in-domain performance, offering a scalable solution for LLM-based ICR in realistic long-context retrieval settings. The work highlights practical implications for deploying semantic IR systems that can leverage long candidate lists with reduced computational cost.

Abstract

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.

Scalable In-context Ranking with Generative Models

TL;DR

This paper tackles the efficiency bottleneck of in-context ranking (ICR) with large language models by revealing structured attention patterns in finetuned LLMs and designing BlockRank, a blockwise attention scheme that reduces complexity from per layer to near linear in the number of documents . It combines a structured sparse attention architecture with a contrastive auxiliary loss (InfoNCE) to explicitly optimize query-to-document signaling and enable a fast attention-based inference path. Empirical results on BEIR, MSMarco, and Natural Questions show BlockRank matches or surpasses strong baselines while delivering substantial inference speedups (e.g., ~4.7x at , scaling to 500 docs within ~1s). The method demonstrates strong zero-shot generalization and competitive in-domain performance, offering a scalable solution for LLM-based ICR in realistic long-context retrieval settings. The work highlights practical implications for deploying semantic IR systems that can leverage long candidate lists with reduced computational cost.

Abstract

In-context Ranking (ICR) is an emerging paradigm for Information Retrieval (IR), which leverages contextual understanding of LLMs by directly incorporating the task description, candidate documents, and the query into the model's input prompt and tasking the LLM to identify relevant document(s). While it is effective, efficiency is a significant challenge in this paradigm, especially as the candidate list grows due to quadratic/super-linear scaling of attention operation with context length. To this end, this paper first identifies inherent and exploitable structures in the attention of LLMs finetuned for ICR: (1) inter-document block sparsity: attention is dense within each document block but sparse across different documents in the context; and (2) query-document block relevance: the attention scores from certain query tokens to a document block in middle layers strongly correlate with that document's actual relevance. Motivated by these observations, we introduce BlockRank (Blockwise In-context Ranking), a novel method that adapts the attention operation in an LLM by (a) architecturally enforcing the observed inter-document block sparsity, reducing attention complexity from quadratic to linear without loss in performance, and (b) optimizing query-document block relevance for true relevant documents during fine-tuning using an auxiliary contrastive training objective, improving retrieval in attention. Experiments on BEIR, MSMarco and NQ with Mistral-7B demonstrate that BlockRank Mistral matches or outperforms existing SOTA listwise rankers and controlled fine-tuned baseline while being significantly more efficient at inference (4.7x for 100 MSMarco documents in context) and scaling gracefully to long-context shortlists, around 500 documents in-context (approximately 100K context length) within a second, presenting a scalable and effective solution for ICR.

Paper Structure

This paper contains 47 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Analysis of attention patterns in Mistral-7B performing In-context Ranking (ICR) on MSMarco. (left) Attention averaged over middle layers 16-21 reveals structural sparsity --- a strong diagonal (intra-document attention needed for local context processing) and significant attention to the first row (focus on the query-based instruction). (middle) Attention in Layer 18 from individual query tokens to document segments. Certain tokens (the last token, ':') attend primarily to the relevant document only (i.e., Doc24, highlighted in green). (right) Attention from final query tokens across layers shows retrieval signals strengthening in middle layers. These patterns motivate our BlockRank approach.
  • Figure 2: BlockRank starts with chunking the full prompt into segments and then processes it using structured attention, where the documents only attend to themselves and the instruction segment, while the query segment attends to the full prompt. It also incorporates an auxiliary attention loss ($\mathcal{L}_{\text{aux}}$) from a middle layer ($l^*$) that increases sharpness of attention on the relevant documents and enables an alternate inference mechanism using attention scores derived from $l^*$.
  • Figure 3: Example structure of the prompt template used in our experiments, showing query-based instruction, abbreviated document list, and the final query section.
  • Figure 4: P@1 and Latency (annotated) of BlockRank vs Full-FT Mistral, scaling $N$ on MSMarco.
  • Figure 5: Performance of Full-FT model's attention-based inference vs the query token for which attention scores are extracted from.
  • ...and 2 more figures