Table of Contents
Fetching ...

Rank-K: Test-Time Reasoning for Listwise Reranking

Eugene Yang, Andrew Yates, Kathryn Ricci, Orion Weller, Vivek Chari, Benjamin Van Durme, Dawn Lawrie

TL;DR

Rank-K addresses the challenge of achieving high-accuracy passage reranking without prohibitive compute by introducing test-time reasoning into a listwise reranker. It hinges on distilling reasoning traces from a large teacher into a tractable 32B model (QwQ-32B) using LoRA, enabling efficient yet effective reasoning-based ranking at inference. Empirical results across DL19/20, NeuCLIR, and BRIGHT show Rank-K surpasses the prior state-of-the-art RankZephyr, with strong cross-language transfer and robust performance when reranking strong initial results. The work demonstrates the practicality of test-time reasoning for multilingual IR and provides resources for distillation and benchmarking, signaling a viable path for scalable, reasoning-informed retrieval systems.

Abstract

Retrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23\% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19\% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.

Rank-K: Test-Time Reasoning for Listwise Reranking

TL;DR

Rank-K addresses the challenge of achieving high-accuracy passage reranking without prohibitive compute by introducing test-time reasoning into a listwise reranker. It hinges on distilling reasoning traces from a large teacher into a tractable 32B model (QwQ-32B) using LoRA, enabling efficient yet effective reasoning-based ranking at inference. Empirical results across DL19/20, NeuCLIR, and BRIGHT show Rank-K surpasses the prior state-of-the-art RankZephyr, with strong cross-language transfer and robust performance when reranking strong initial results. The work demonstrates the practicality of test-time reasoning for multilingual IR and provides resources for distillation and benchmarking, signaling a viable path for scalable, reasoning-informed retrieval systems.

Abstract

Retrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23\% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19\% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.

Paper Structure

This paper contains 15 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of the training pipeline for Rank-K.
  • Figure 2: Prompt for Reasoning and Ranking Passages
  • Figure 3: Histogram of the number of ranking Rank-K generates in reranking for each query on TREC DL 2019 and 2020. The count includes the intermediate partial rankings and the final full ranking. We see that Rank-K generates a non-uniform distribution of rankings.
  • Figure 4: A partial example thinking process produced by Rank-K. Passage summaries and self-reflection are omitted for presentation.