Table of Contents
Fetching ...

JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

Tong Niu, Shafiq Joty, Ye Liu, Caiming Xiong, Yingbo Zhou, Semih Yavuz

TL;DR

JudgeRank is introduced, a novel agentic reranker that emulates human cognitive processes when assessing document relevance and generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.

Abstract

Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance. To address this limitation, we introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. Our approach consists of three key steps: (1) query analysis to identify the core problem, (2) document analysis to extract a query-aware summary, and (3) relevance judgment to provide a concise assessment of document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods and outperforming other popular reranking approaches. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability. Through comprehensive ablation studies, we demonstrate that JudgeRank's performance generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.

JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

TL;DR

JudgeRank is introduced, a novel agentic reranker that emulates human cognitive processes when assessing document relevance and generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.

Abstract

Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance. To address this limitation, we introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. Our approach consists of three key steps: (1) query analysis to identify the core problem, (2) document analysis to extract a query-aware summary, and (3) relevance judgment to provide a concise assessment of document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods and outperforming other popular reranking approaches. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability. Through comprehensive ablation studies, we demonstrate that JudgeRank's performance generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.

Paper Structure

This paper contains 32 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A step-by-step illustration of how JudgeRank arrives at the final judgment through query and document analyses. The query analysis identifies the core problem being asked, while the document analysis extracts relevant sentences from the document based on the query. This is a real example from the Biology task in the BRIGHT evaluation benchmark.
  • Figure 2: (a) Prompt to analyze query, where {query name} (e.g., "Biology post") and {query} are placeholders for the query type and content. (b) Prompt for analyzing a document, where {doc name} (e.g., "document") and {doc} are placeholders for the document type and content. (c) Prompt for making the final one-word relevance judgment.
  • Figure 3: On the left: judgment alignment studies for models of three sizes: $8$B, $70$B, and $405$B. Percentages are shown for each quadrant. On the right: nDCG@10 of each individual model and model ensembling on the BRIGHT evaluation benchmark.
  • Figure 4: Ablation studies of JudgeRank. On the left: Comparison of three scoring settings on the BRIGHT evaluation benchmark. Binary stands for binary judgment, Prob stands for probability, and Hybrid stands for a weighted sum of BM25 and probability scores. On the right: Comparison of direct judge and judge with query and document analyses on the BRIGHT evaluation benchmark.
  • Figure 5: Illustration of how agentic generations of JudgeRank help identifying the relevant documents. On the left, the document is ranked high by the first-stage retriever but judged as negative by the reranker. On the right, the document is ranked low by the first-stage retriever but judged as positive by the reranker because the document analysis prompt helps the LLM to locate the relevant sentences that answer the query.