Table of Contents
Fetching ...

Query-focused and Memory-aware Reranker for Long Context Processing

Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang, Jie Zhou

TL;DR

QRRanker reframes reranking as a training problem for Query-focused Retrieval (QR) heads inside LLMs, enabling a lightweight listwise ranker that outputs real-valued relevance scores without generation at inference. By selecting a small set of QR heads and training them with a contrastive objective on top-50 candidate passages, the method achieves state-of-the-art recall across Wikipedia multi-hop, long-context stories, and LoCoMo dialogue memory benchmarks while maintaining efficiency. It further demonstrates memory-aware extensions through prefix summaries and shows that training heads from middle layers yields near-equivalent performance with substantial latency reductions. The approach offers practical, scalable improvements for long-context processing and highlights easy extensions for incorporating global context without heavy memory management.

Abstract

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.

Query-focused and Memory-aware Reranker for Long Context Processing

TL;DR

QRRanker reframes reranking as a training problem for Query-focused Retrieval (QR) heads inside LLMs, enabling a lightweight listwise ranker that outputs real-valued relevance scores without generation at inference. By selecting a small set of QR heads and training them with a contrastive objective on top-50 candidate passages, the method achieves state-of-the-art recall across Wikipedia multi-hop, long-context stories, and LoCoMo dialogue memory benchmarks while maintaining efficiency. It further demonstrates memory-aware extensions through prefix summaries and shows that training heads from middle layers yields near-equivalent performance with substantial latency reductions. The approach offers practical, scalable improvements for long-context processing and highlights easy extensions for incorporating global context without heavy memory management.

Abstract

Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.
Paper Structure (38 sections, 5 equations, 2 figures, 7 tables, 1 algorithm)

This paper contains 38 sections, 5 equations, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: The retrieval score and QR score are computed based on the attention score of a (QR) attention head. In this figure, Doc2 is the gold document (chunk).
  • Figure 2: The structure of QRRanker is illustrated in the middle, where the highlighted heads are QR heads for document scoring. As QRRanker can be aware of memory enhancement to capture more contextual information, we can construct memories for narratives and dialogues, which is shown on the left. The right part demonstrates the rank-rerank pipeline of qa for narratives/wiki/dialogues, which involves no sophisticated design.