MRAG: A Modular Retrieval Framework for Time-Sensitive Question Answering
Zhang Siyue, Xue Yuxiang, Zhang Yiming, Wu Xiaobao, Luu Anh Tuan, Zhao Chen
TL;DR
MRAG tackles the challenge of time-sensitive QA by disentangling semantic relevance from temporal reasoning in a trainless, modular retrieval framework. It introduces TempRAGEval, a diagnostic benchmark with temporal perturbations and gold evidence to stress-test retrieval and QA components. MRAG combines Question Processing, Retrieval and Summarization, and Semantic-Temporal Hybrid Ranking to achieve substantial retrieval gains, which in turn improve end QA performance across multiple LLM backbones; notable gains include 9.3% top-1 recall and 11% top-1 evidence recall, with downstream EM/F1 improvements. The work demonstrates that integrating symbolic temporal scoring with semantic retrieval enhances robustness to temporal perturbations and sets a benchmark for future reasoning-intensive retrieval research.
Abstract
Understanding temporal relations and answering time-sensitive questions is crucial yet a challenging task for question-answering systems powered by large language models (LLMs). Existing approaches either update the parametric knowledge of LLMs with new facts, which is resource-intensive and often impractical, or integrate LLMs with external knowledge retrieval (i.e., retrieval-augmented generation). However, off-the-shelf retrievers often struggle to identify relevant documents that require intensive temporal reasoning. To systematically study time-sensitive question answering, we introduce the TempRAGEval benchmark, which repurposes existing datasets by incorporating temporal perturbations and gold evidence labels. As anticipated, all existing retrieval methods struggle with these temporal reasoning-intensive questions. We further propose Modular Retrieval (MRAG), a trainless framework that includes three modules: (1) Question Processing that decomposes question into a main content and a temporal constraint; (2) Retrieval and Summarization that retrieves evidence and uses LLMs to summarize according to the main content; (3) Semantic-Temporal Hybrid Ranking that scores each evidence summarization based on both semantic and temporal relevance. On TempRAGEval, MRAG significantly outperforms baseline retrievers in retrieval performance, leading to further improvements in final answer accuracy.
