MST-R: Multi-Stage Tuning for Retrieval Systems and Metric Evaluation
Yash Malviya, Karan Dhingra, Maneesh Singh
TL;DR
The paper tackles the challenge of applying retrieval-augmented generation to regulatory QA by proposing MST-R, a multi-stage, domain-adaptive tuning approach for retrieval components. It introduces a two-level retrieval pipeline with a domain-adapted hybrid L1 retriever and a cross-encoder reranker, followed by three answer-generation strategies using a fixed LLM. On ObliQA, MST-R achieves state-of-the-art Recall@10 and MAP@10 and demonstrates notable gains over baselines, while also revealing weaknesses in the RePASs metric through analysis of trivial optimizers and extended reasoning contexts. The authors advocate for better evaluation frameworks, including LLM-based judgments, to reliably assess regulatory QA performance. The work highlights practical implications for deploying regulatory QA systems and suggests future directions for end-to-end domain adaptation of the answer generator and metric design.
Abstract
Regulatory documents are rich in nuanced terminology and specialized semantics. FRAG systems: Frozen retrieval-augmented generators utilizing pre-trained (or, frozen) components face consequent challenges with both retriever and answering performance. We present a system that adapts the retriever performance to the target domain using a multi-stage tuning (MST) strategy. Our retrieval approach, called MST-R (a) first fine-tunes encoders used in vector stores using hard negative mining, (b) then uses a hybrid retriever, combining sparse and dense retrievers using reciprocal rank fusion, and then (c) adapts the cross-attention encoder by fine-tuning only the top-k retrieved results. We benchmark the system performance on the dataset released for the RIRAG challenge (as part of the RegNLP workshop at COLING 2025). We achieve significant performance gains obtaining a top rank on the RegNLP challenge leaderboard. We also show that a trivial answering approach games the RePASs metric outscoring all baselines and a pre-trained Llama model. Analyzing this anomaly, we present important takeaways for future research.
