Table of Contents
Fetching ...

Structured RAG for Answering Aggregative Questions

Omri Koshorek, Niv Granot, Aviv Alloni, Shahar Admati, Roee Hendel, Ido Weiss, Alan Arazi, Shay-Nitzan Cohen, Yonatan Belinkov

TL;DR

This paper tackles aggregative question answering over large, unstructured private corpora, where answers require reasoning across many documents and aggregating information. It introduces Structured Retrieval Augmented Generation (S-RAG), which ingests data to induce a unified schema and stores records in a database, then translates natural language queries into SQL to retrieve answers with an LLM-driven justification. The authors present two new aggregative QA datasets, Hotels and World Cup, and show that S-RAG, especially with a gold schema, substantially outperforms standard VectorRAG, full-corpus, and deployed systems on these benchmarks and FinanceBench. The work demonstrates the value of structure-aware retrieval for complex, multi-document reasoning and lays groundwork for future research in schema learning and aggregative reasoning over unstructured corpora.

Abstract

Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.

Structured RAG for Answering Aggregative Questions

TL;DR

This paper tackles aggregative question answering over large, unstructured private corpora, where answers require reasoning across many documents and aggregating information. It introduces Structured Retrieval Augmented Generation (S-RAG), which ingests data to induce a unified schema and stores records in a database, then translates natural language queries into SQL to retrieve answers with an LLM-driven justification. The authors present two new aggregative QA datasets, Hotels and World Cup, and show that S-RAG, especially with a gold schema, substantially outperforms standard VectorRAG, full-corpus, and deployed systems on these benchmarks and FinanceBench. The work demonstrates the value of structure-aware retrieval for complex, multi-document reasoning and lays groundwork for future research in schema learning and aggregative reasoning over unstructured corpora.

Abstract

Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.

Paper Structure

This paper contains 35 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: S-RAG overview. Ingestion phase (upper): given a small set of questions and documents, the system predicts a schema. Then it predicts a record for each document in the corpus, populating a structured DB. Inference phase (lower): A user query is translated into an SQL query that is run on the database to return an answer.
  • Figure 2: Illustration of a naive CVs corpus, schema and a single record. An example of an aggregate query on such a corpus could be: ‘Which candidates has more than two years of experience?’
  • Figure 3: A randomly selected document from the HOTELS dataset