ER-RAG: Enhance RAG with ER-Based Unified Modeling of Heterogeneous Data Sources
Yikuan Xia, Jiazun Chen, Yirui Zhan, Suifeng Zhao, Weipeng Jiang, Chaorui Zhang, Wei Han, Bo Bai, Jun Gao
TL;DR
ER-RAG addresses the challenge of integrating evidence from heterogeneous data sources into RAG by adopting an Entity-Relationship (ER) model to unify access via the $GET$ and $JOIN$ primitives. It introduces a two-stage generation pipeline: first, a source-selection module fine-tuned with direct policy optimization (DPO) to choose sources, then an API-chain generator that builds retrieval workflows aligned to source schemas, followed by a post-processing module that formats results. The authors demonstrate that ER-RAG achieves competitive performance on the CRAG and CompMix benchmarks, matching commercial RAG pipelines with an 8B backbone and outperforming hybrid approaches by about 3.1% in LLM scores and 5.5x in retrieval speed. The work offers a practical and extensible framework for cross-source QA, enabling easier fine-tuning and deployment in real-world, low-resource settings.
Abstract
Large language models (LLMs) excel in question-answering (QA) tasks, and retrieval-augmented generation (RAG) enhances their precision by incorporating external evidence from diverse sources like web pages, databases, and knowledge graphs. However, current RAG methods rely on agent-specific strategies for individual data sources, posing challenges low-resource or black-box environments and complicates operations when evidence is fragmented across sources. To address these limitations, we propose ER-RAG, a framework that unifies evidence integration across heterogeneous data sources using the Entity-Relationship (ER) model. ER-RAG standardizes entity retrieval and relationship querying through ER-based APIs with GET and JOIN operations. It employs a two-stage generation process: first, a preference optimization module selects optimal sources; second, another module constructs API chains based on source schemas. This unified approach allows efficient fine-tuning and seamless integration across diverse data sources. ER-RAG demonstrated its effectiveness by winning all three tracks of the 2024 KDDCup CRAG Challenge, achieving performance on par with commercial RAG pipelines using an 8B LLM backbone. It outperformed hybrid competitors by 3.1% in LLM score and accelerated retrieval by 5.5X.
