Table of Contents
Fetching ...

MRAG: A Modular Retrieval Framework for Time-Sensitive Question Answering

Zhang Siyue, Xue Yuxiang, Zhang Yiming, Wu Xiaobao, Luu Anh Tuan, Zhao Chen

TL;DR

MRAG tackles the challenge of time-sensitive QA by disentangling semantic relevance from temporal reasoning in a trainless, modular retrieval framework. It introduces TempRAGEval, a diagnostic benchmark with temporal perturbations and gold evidence to stress-test retrieval and QA components. MRAG combines Question Processing, Retrieval and Summarization, and Semantic-Temporal Hybrid Ranking to achieve substantial retrieval gains, which in turn improve end QA performance across multiple LLM backbones; notable gains include 9.3% top-1 recall and 11% top-1 evidence recall, with downstream EM/F1 improvements. The work demonstrates that integrating symbolic temporal scoring with semantic retrieval enhances robustness to temporal perturbations and sets a benchmark for future reasoning-intensive retrieval research.

Abstract

Understanding temporal relations and answering time-sensitive questions is crucial yet a challenging task for question-answering systems powered by large language models (LLMs). Existing approaches either update the parametric knowledge of LLMs with new facts, which is resource-intensive and often impractical, or integrate LLMs with external knowledge retrieval (i.e., retrieval-augmented generation). However, off-the-shelf retrievers often struggle to identify relevant documents that require intensive temporal reasoning. To systematically study time-sensitive question answering, we introduce the TempRAGEval benchmark, which repurposes existing datasets by incorporating temporal perturbations and gold evidence labels. As anticipated, all existing retrieval methods struggle with these temporal reasoning-intensive questions. We further propose Modular Retrieval (MRAG), a trainless framework that includes three modules: (1) Question Processing that decomposes question into a main content and a temporal constraint; (2) Retrieval and Summarization that retrieves evidence and uses LLMs to summarize according to the main content; (3) Semantic-Temporal Hybrid Ranking that scores each evidence summarization based on both semantic and temporal relevance. On TempRAGEval, MRAG significantly outperforms baseline retrievers in retrieval performance, leading to further improvements in final answer accuracy.

MRAG: A Modular Retrieval Framework for Time-Sensitive Question Answering

TL;DR

MRAG tackles the challenge of time-sensitive QA by disentangling semantic relevance from temporal reasoning in a trainless, modular retrieval framework. It introduces TempRAGEval, a diagnostic benchmark with temporal perturbations and gold evidence to stress-test retrieval and QA components. MRAG combines Question Processing, Retrieval and Summarization, and Semantic-Temporal Hybrid Ranking to achieve substantial retrieval gains, which in turn improve end QA performance across multiple LLM backbones; notable gains include 9.3% top-1 recall and 11% top-1 evidence recall, with downstream EM/F1 improvements. The work demonstrates that integrating symbolic temporal scoring with semantic retrieval enhances robustness to temporal perturbations and sets a benchmark for future reasoning-intensive retrieval research.

Abstract

Understanding temporal relations and answering time-sensitive questions is crucial yet a challenging task for question-answering systems powered by large language models (LLMs). Existing approaches either update the parametric knowledge of LLMs with new facts, which is resource-intensive and often impractical, or integrate LLMs with external knowledge retrieval (i.e., retrieval-augmented generation). However, off-the-shelf retrievers often struggle to identify relevant documents that require intensive temporal reasoning. To systematically study time-sensitive question answering, we introduce the TempRAGEval benchmark, which repurposes existing datasets by incorporating temporal perturbations and gold evidence labels. As anticipated, all existing retrieval methods struggle with these temporal reasoning-intensive questions. We further propose Modular Retrieval (MRAG), a trainless framework that includes three modules: (1) Question Processing that decomposes question into a main content and a temporal constraint; (2) Retrieval and Summarization that retrieves evidence and uses LLMs to summarize according to the main content; (3) Semantic-Temporal Hybrid Ranking that scores each evidence summarization based on both semantic and temporal relevance. On TempRAGEval, MRAG significantly outperforms baseline retrievers in retrieval performance, leading to further improvements in final answer accuracy.

Paper Structure

This paper contains 61 sections, 3 equations, 11 figures, 19 tables.

Figures (11)

  • Figure 1: A time-sensitive question example that requires temporal reasoning (as of 6 May 2021$\rightarrow$2019 - 2022) to both retrieve documents and generate answers. State-of-the-art retrieval systems struggle to conduct in-depth reasoning to identify relevant documents. We provide a new diagnostic benchmark TempRAGEval, and propose a new modular framework to tackle this challenge.
  • Figure 2: The retrieval performance degradation of the Gemma baseline on TempRAGEval-SituatedQA, comparing original and perturbed questions (see TempRAGEval-TimeQA in \ref{['sec: degrade']}).
  • Figure 3: An overview of the MRAG framework, consisting of three key modules: question processing, retrieval and summarization, and semantic-temporal hybrid ranking. The question processing module separates each query into the main content (i.e., MC) and the temporal constraint (i.e., TC). The retrieval and summarization module finds the most relevant evidence based on the main content and summarizes or splits these evidence into fine-grained sentences. The hybrid ranking module combines symbolic temporal scoring and dense embedding-based semantic scoring at a fine-grained level to determine the final evidence ranking.
  • Figure 4: A case study for top-1 passage retrieved by Gemma and MRAG from TempRAGEval.
  • Figure 5: Similarity scores of query-document pairs by varying the temporal relation in the query and the date in the document.
  • ...and 6 more figures