Table of Contents
Fetching ...

DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers

Navve Wasserman, Oliver Heinimann, Yuval Golbari, Tal Zimbalist, Eli Schwartz, Michal Irani

TL;DR

This work addresses the inefficiencies of traditional hard negative mining for multimodal RAG rerankers by proposing Single-Page Hard Negative Query Generation. Instead of mining hard negative documents, it generates hard negative queries per page via an LLM-VLM pipeline, enabling precise control over the negatives and efficient verification. The DocReRank model, trained on a mix of document-based negatives, generated hard negatives, and rephrased positives, achieves superior reranking performance across ViDoReV2 and Real-MM-RAG benchmarks, with further gains when incorporating finance-focused and rephrased data. The approach demonstrates that query-level generation can yield richer, more targeted training data, improving robustness to fine-grained factual distinctions and domain-specific challenges in multimodal document understanding.

Abstract

Rerankers play a critical role in multimodal Retrieval-Augmented Generation (RAG) by refining ranking of an initial set of retrieved documents. Rerankers are typically trained using hard negative mining, whose goal is to select pages for each query which rank high, but are actually irrelevant. However, this selection process is typically passive and restricted to what the retriever can find in the available corpus, leading to several inherent limitations. These include: limited diversity, negative examples which are often not hard enough, low controllability, and frequent false negatives which harm training. Our paper proposes an alternative approach: Single-Page Hard Negative Query Generation, which goes the other way around. Instead of retrieving negative pages per query, we generate hard negative queries per page. Using an automated LLM-VLM pipeline, and given a page and its positive query, we create hard negatives by rephrasing the query to be as similar as possible in form and context, yet not answerable from the page. This paradigm enables fine-grained control over the generated queries, resulting in diverse, hard, and targeted negatives. It also supports efficient false negative verification. Our experiments show that rerankers trained with data generated using our approach outperform existing models and significantly improve retrieval performance.

DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers

TL;DR

This work addresses the inefficiencies of traditional hard negative mining for multimodal RAG rerankers by proposing Single-Page Hard Negative Query Generation. Instead of mining hard negative documents, it generates hard negative queries per page via an LLM-VLM pipeline, enabling precise control over the negatives and efficient verification. The DocReRank model, trained on a mix of document-based negatives, generated hard negatives, and rephrased positives, achieves superior reranking performance across ViDoReV2 and Real-MM-RAG benchmarks, with further gains when incorporating finance-focused and rephrased data. The approach demonstrates that query-level generation can yield richer, more targeted training data, improving robustness to fine-grained factual distinctions and domain-specific challenges in multimodal document understanding.

Abstract

Rerankers play a critical role in multimodal Retrieval-Augmented Generation (RAG) by refining ranking of an initial set of retrieved documents. Rerankers are typically trained using hard negative mining, whose goal is to select pages for each query which rank high, but are actually irrelevant. However, this selection process is typically passive and restricted to what the retriever can find in the available corpus, leading to several inherent limitations. These include: limited diversity, negative examples which are often not hard enough, low controllability, and frequent false negatives which harm training. Our paper proposes an alternative approach: Single-Page Hard Negative Query Generation, which goes the other way around. Instead of retrieving negative pages per query, we generate hard negative queries per page. Using an automated LLM-VLM pipeline, and given a page and its positive query, we create hard negatives by rephrasing the query to be as similar as possible in form and context, yet not answerable from the page. This paradigm enables fine-grained control over the generated queries, resulting in diverse, hard, and targeted negatives. It also supports efficient false negative verification. Our experiments show that rerankers trained with data generated using our approach outperform existing models and significantly improve retrieval performance.

Paper Structure

This paper contains 30 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Proposed Single-Page Hard Negative Query Generation Approach. While previous approaches retrieve hard negative pages per query from a document corpus, our method goes the other way around: We generate hard negative queries per page using an automated LLM-VLM pipeline. Our reranker, "DocReRank" which trains on this kind of data, outperforms models trained with document-based hard negatives.
  • Figure 2: Re-ranking Framework. Given a query and a document corpus, a retrieval model first retrieves the top-$K$ relevant pages. A reranker then reorders these $K$ pages based on the query to improve retrieval quality.
  • Figure 3: Dataset Construction Pipeline.
  • Figure 4: Examples of Our Generated Negative Queries. We show examples of a cropped page and its positive query, along with the generated negative queries. Top: hard negatives generated using the general pipeline. Bottom: negatives generated using finance fine-detail prompts, which modify specific properties in the query.
  • Figure S1: Positive Query Generation Prompt: Creating RAG style queries, answerable by the corresponding document, using a Pixtral-12B VLM agrawal2024pixtral. $N$ positive candidates are generated with the given prompt. The prompt emphasizes multimodal understanding by focusing on page elements such as figures, tables, and diagrams.
  • ...and 5 more figures