Table of Contents
Fetching ...

LoRE: Logit-Ranked Retriever Ensemble for Enhancing Open-Domain Question Answering

Saikrishna Sanniboina, Shiv Trivedi, Sreenidhi Vijayaraghavan

TL;DR

This work tackles positional bias and inefficiency in retrieval-augmented QA by introducing LoRE, a Logit-Ranked Retriever Ensemble. LoRE combines an ensemble of retrievers (Vector FAISS and BM25) with a logit-based ranking system that fuses LLM confidence with retrieval ranks to select the most substantiated passages, while generating per-context answers with a T5 Large model. The LoR scoring framework, which blends Mean Score from the generator with a Context Rank term ($\text{LoR} = (w_1 \times \text{Mean Score}) + (w_2 \times \dfrac{1}{\text{Context Rank}})$, $w_1=0.8$, $w_2=0.2$), aims to reduce hallucinations and improve answer relevance. Empirical results on SQuAD show substantial gains (ROUGE-L up to $64.8\%$, EM $61.45\%$, F1 $69.27\%$ with improvements of $+14.5\%$, $+22.83\%$, and $+14.95\%$, respectively), while NarrativeQA demonstrates notable improvements in both exact-match and F1, with qualitative analyses highlighting reduced hallucinations and more accurate, contextually grounded answers. Overall, LoRE advances retrieval-based QA by combining retrieval diversity, logit-informed validation, and contextual prioritization to deliver more reliable and efficient open-domain question answering. $LoR$ scoring and ensemble fusion provide a practical pathway to scalable, bias-resistant QA in real-world systems.

Abstract

Retrieval-based question answering systems often suffer from positional bias, leading to suboptimal answer generation. We propose LoRE (Logit-Ranked Retriever Ensemble), a novel approach that improves answer accuracy and relevance by mitigating positional bias. LoRE employs an ensemble of diverse retrievers, such as BM25 and sentence transformers with FAISS indexing. A key innovation is a logit-based answer ranking algorithm that combines the logit scores from a large language model (LLM), with the retrieval ranks of the passages. Experimental results on NarrativeQA, SQuAD demonstrate that LoRE significantly outperforms existing retrieval-based methods in terms of exact match and F1 scores. On SQuAD, LoRE achieves 14.5\%, 22.83\%, and 14.95\% improvements over the baselines for ROUGE-L, EM, and F1, respectively. Qualitatively, LoRE generates more relevant and accurate answers, especially for complex queries.

LoRE: Logit-Ranked Retriever Ensemble for Enhancing Open-Domain Question Answering

TL;DR

This work tackles positional bias and inefficiency in retrieval-augmented QA by introducing LoRE, a Logit-Ranked Retriever Ensemble. LoRE combines an ensemble of retrievers (Vector FAISS and BM25) with a logit-based ranking system that fuses LLM confidence with retrieval ranks to select the most substantiated passages, while generating per-context answers with a T5 Large model. The LoR scoring framework, which blends Mean Score from the generator with a Context Rank term (, , ), aims to reduce hallucinations and improve answer relevance. Empirical results on SQuAD show substantial gains (ROUGE-L up to , EM , F1 with improvements of , , and , respectively), while NarrativeQA demonstrates notable improvements in both exact-match and F1, with qualitative analyses highlighting reduced hallucinations and more accurate, contextually grounded answers. Overall, LoRE advances retrieval-based QA by combining retrieval diversity, logit-informed validation, and contextual prioritization to deliver more reliable and efficient open-domain question answering. scoring and ensemble fusion provide a practical pathway to scalable, bias-resistant QA in real-world systems.

Abstract

Retrieval-based question answering systems often suffer from positional bias, leading to suboptimal answer generation. We propose LoRE (Logit-Ranked Retriever Ensemble), a novel approach that improves answer accuracy and relevance by mitigating positional bias. LoRE employs an ensemble of diverse retrievers, such as BM25 and sentence transformers with FAISS indexing. A key innovation is a logit-based answer ranking algorithm that combines the logit scores from a large language model (LLM), with the retrieval ranks of the passages. Experimental results on NarrativeQA, SQuAD demonstrate that LoRE significantly outperforms existing retrieval-based methods in terms of exact match and F1 scores. On SQuAD, LoRE achieves 14.5\%, 22.83\%, and 14.95\% improvements over the baselines for ROUGE-L, EM, and F1, respectively. Qualitatively, LoRE generates more relevant and accurate answers, especially for complex queries.

Paper Structure

This paper contains 28 sections, 12 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Ensemble of Retrievers
  • Figure 2: Query and Context Interaction using T5 Model
  • Figure 3: Logit Evaluation and Context Rank Integration
  • Figure 4: The one with the highest probability is the correct answer and was also ranked first
  • Figure 5: The one with the highest probability was the correct answer even though it was not ranked first
  • ...and 4 more figures