LoRE: Logit-Ranked Retriever Ensemble for Enhancing Open-Domain Question Answering
Saikrishna Sanniboina, Shiv Trivedi, Sreenidhi Vijayaraghavan
TL;DR
This work tackles positional bias and inefficiency in retrieval-augmented QA by introducing LoRE, a Logit-Ranked Retriever Ensemble. LoRE combines an ensemble of retrievers (Vector FAISS and BM25) with a logit-based ranking system that fuses LLM confidence with retrieval ranks to select the most substantiated passages, while generating per-context answers with a T5 Large model. The LoR scoring framework, which blends Mean Score from the generator with a Context Rank term ($\text{LoR} = (w_1 \times \text{Mean Score}) + (w_2 \times \dfrac{1}{\text{Context Rank}})$, $w_1=0.8$, $w_2=0.2$), aims to reduce hallucinations and improve answer relevance. Empirical results on SQuAD show substantial gains (ROUGE-L up to $64.8\%$, EM $61.45\%$, F1 $69.27\%$ with improvements of $+14.5\%$, $+22.83\%$, and $+14.95\%$, respectively), while NarrativeQA demonstrates notable improvements in both exact-match and F1, with qualitative analyses highlighting reduced hallucinations and more accurate, contextually grounded answers. Overall, LoRE advances retrieval-based QA by combining retrieval diversity, logit-informed validation, and contextual prioritization to deliver more reliable and efficient open-domain question answering. $LoR$ scoring and ensemble fusion provide a practical pathway to scalable, bias-resistant QA in real-world systems.
Abstract
Retrieval-based question answering systems often suffer from positional bias, leading to suboptimal answer generation. We propose LoRE (Logit-Ranked Retriever Ensemble), a novel approach that improves answer accuracy and relevance by mitigating positional bias. LoRE employs an ensemble of diverse retrievers, such as BM25 and sentence transformers with FAISS indexing. A key innovation is a logit-based answer ranking algorithm that combines the logit scores from a large language model (LLM), with the retrieval ranks of the passages. Experimental results on NarrativeQA, SQuAD demonstrate that LoRE significantly outperforms existing retrieval-based methods in terms of exact match and F1 scores. On SQuAD, LoRE achieves 14.5\%, 22.83\%, and 14.95\% improvements over the baselines for ROUGE-L, EM, and F1, respectively. Qualitatively, LoRE generates more relevant and accurate answers, especially for complex queries.
