Table of Contents
Fetching ...

ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision

Dosung Lee, Wonjun Oh, Boyoung Kim, Minyoung Kim, Joonsuk Park, Paul Hongsuck Seo

TL;DR

This work tackles training dense retrievers for multi-hop QA without labeled query-document pairs. It introduces ReSCORE, which uses large language model probabilities to generate pseudo-ground-truth that jointly captures document relevance to a question and consistency with the correct answer. Integrated into an iterative RAG framework (IQATR), ReSCORE achieves state-of-the-art MHQA performance across MuSiQue, 2WikiMHQA, and HotpotQA, while also enabling deeper analysis of pseudo-GT labels and query reformulation strategies. The results highlight the potential of label-free retriever training for complex multi-hop reasoning, with practical considerations around generalization and computation.

Abstract

Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries (reformulated) questions throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each documents relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance. Our implementation is available at: https://leeds1219.github.io/ReSCORE.

ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision

TL;DR

This work tackles training dense retrievers for multi-hop QA without labeled query-document pairs. It introduces ReSCORE, which uses large language model probabilities to generate pseudo-ground-truth that jointly captures document relevance to a question and consistency with the correct answer. Integrated into an iterative RAG framework (IQATR), ReSCORE achieves state-of-the-art MHQA performance across MuSiQue, 2WikiMHQA, and HotpotQA, while also enabling deeper analysis of pseudo-GT labels and query reformulation strategies. The results highlight the potential of label-free retriever training for complex multi-hop reasoning, with practical considerations around generalization and computation.

Abstract

Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries (reformulated) questions throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each documents relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance. Our implementation is available at: https://leeds1219.github.io/ReSCORE.

Paper Structure

This paper contains 24 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Iterative RAG Framework for MHQA. At iteration $i$, the framework first retrieves top $k$ documents relevant to the current query $q^{(i)}$ to generate an answer $a^{(i)}$. (a) If the answer is "unknown", a thought $t^{(i)}$ is generated as a compact representation of the retrieved documents based on the query $q^{(i)}$. This thought is then used to reformulate the query for the next iteration $q^{(i+1)}$ and continues the next iteration. (b) If $a^{(i)}$ is not "unknown", the iteration ends, and $a^{(i)}$ is returned as the final answer.
  • Figure 2: Overview of ReSCORE. At each iteration $i$ within a iterative RAG process, the retriever receives gradients from the KL-Divergence loss of the retrieval distribution $P_R^{(i)}$ against the pseudo-GT distribution $Q_\text{LM}^{(i)}$, which is derived from the LLM probabilities of question and answer given each document $d_j^{(i)}$ with normalization. The number of iterations is dynamically determined by the LLM and the process ends if the LLM predicts an answer which is not "unknown". The red dashed lines represents gradient flows for the retriever.
  • Figure 3: Comparison of GT and Pseudo-GT Labels on All Relevant Document Retrieval. The y-axis shows the proportion of questions for which all relevant documents were found, which are all needed to correctly answer a given complex question. Pseudo-GT labels lead to improved performance as the number of iterations increases.