Enhancing Retrieval in QA Systems with Derived Feature Association
Keyush Shah, Abhishek Goyal, Isaac Wasserman
TL;DR
The paper tackles the problem that retrieval in retrieval-augmented generation (RAG) often fails to surface implicit information needed for complex QA. It introduces RAIDD, a two-phase framework that derives features from ingested documents (summaries and questions) to guide retrieval, while preserving access to the original text. Four flavors (RAIDD-S, RAIDD-S ICL, RAIDD-Q, RAIDD-U) demonstrate how derived documents can enhance context selection and answer accuracy on long-context QA tasks, with improvements up to 15% in QA accuracy on the LooGLE dataset. The work highlights that enhanced retrieval must be matched by capable LLM generation, and points to future directions including richer derived-document types and further tuning of the derivation process to support cross-domain applicability.
Abstract
Retrieval augmented generation (RAG) has become the standard in long context question answering (QA) systems. However, typical implementations of RAG rely on a rather naive retrieval mechanism, in which texts whose embeddings are most similar to that of the query are deemed most relevant. This has consequences in subjective QA tasks, where the most relevant text may not directly contain the answer. In this work, we propose a novel extension to RAG systems, which we call Retrieval from AI Derived Documents (RAIDD). RAIDD leverages the full power of the LLM in the retrieval process by deriving inferred features, such as summaries and example questions, from the documents at ingest. We demonstrate that this approach significantly improves the performance of RAG systems on long-context QA tasks.
