Graph-Guided Concept Selection for Efficient Retrieval-Augmented Generation
Ziyu Liu, Yijing Liu, Jianfei Yuan, Minzhi Yan, Le Yue, Honghui Xiong, Yi Yang
TL;DR
GraphRAG enhances QA by leveraging a knowledge graph but incurs prohibitive construction costs. The authors propose G2ConS, which combines Core Chunk Selection with a LLM-independent Concept Graph to identify high-value concepts and prune input chunks, enabling dual-path retrieval over both the concept graph and a core-KG with a weighted ensemble. Key contributions include a concept-graph construction that blends semantic relevance and co-occurrence, a robust dual-path retrieval strategy with local and global reranking, and extensive ablation and parameter studies demonstrating favorable cost–performance trade-offs across Musique, HotpotQA, and 2WikiMultihopQA. The results show substantial improvements in QA quality and significant reductions in construction costs, supporting scalable, retrieval-augmented QA in multi-hop and domain-specific settings.
Abstract
Graph-based RAG constructs a knowledge graph (KG) from text chunks to enhance retrieval in Large Language Model (LLM)-based question answering. It is especially beneficial in domains such as biomedicine, law, and political science, where effective retrieval often involves multi-hop reasoning over proprietary documents. However, these methods demand numerous LLM calls to extract entities and relations from text chunks, incurring prohibitive costs at scale. Through a carefully designed ablation study, we observe that certain words (termed concepts) and their associated documents are more important. Based on this insight, we propose Graph-Guided Concept Selection (G2ConS). Its core comprises a chunk selection method and an LLM-independent concept graph. The former selects salient document chunks to reduce KG construction costs; the latter closes knowledge gaps introduced by chunk selection at zero cost. Evaluations on multiple real-world datasets show that G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.
