Table of Contents
Fetching ...

LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora

Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang

TL;DR

LinearRAG introduces a token-free Tri-Graph to eliminate unstable relation extraction, enabling linear-scale indexing of large corpora. It uses a two-stage retrieval: local semantic bridging to activate relevant entities, followed by global importance aggregation via Personalized PageRank to rank passages. Empirical results on four benchmarks show LinearRAG outperforms GraphRAG baselines in retrieval quality and generation accuracy, while eliminating token costs and reducing indexing time. The framework provides a practical, scalable solution for complex, multi-hop retrieval tasks with real-world data. Source code and datasets are publicly available for reproducibility.

Abstract

Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.

LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora

TL;DR

LinearRAG introduces a token-free Tri-Graph to eliminate unstable relation extraction, enabling linear-scale indexing of large corpora. It uses a two-stage retrieval: local semantic bridging to activate relevant entities, followed by global importance aggregation via Personalized PageRank to rank passages. Empirical results on four benchmarks show LinearRAG outperforms GraphRAG baselines in retrieval quality and generation accuracy, while eliminating token costs and reducing indexing time. The framework provides a practical, scalable solution for complex, multi-hop retrieval tasks with real-world data. Source code and datasets are publicly available for reproducibility.

Abstract

Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.

Paper Structure

This paper contains 28 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Three paradigms of RAG systems.
  • Figure 2: (a) Retrieval and generation rerformance (%) of Vanilla RAG v.s. GraphRAG Baselines. Notably, the evaluation on Medical dataset measures GPT-based accuracy, context relevance, and evidence recall across different RAG baselines. (b) Case study of relation errors in knowledge graph construction from local inaccuracy and global inconsistency perspectives.
  • Figure 3: The overall pipeline of the proposed LinearRAG framework.I. Offline Construction. Initially, we construct a Tri-graph containing entity, sentence, and passage nodes, with edges connecting entities to sentences and entities to passages. II. Online Retrieval. We first activate relevant entities via local semantic bridging on the entity-sentence subgraph while fixing passage nodes, then using the activated entities to aggregate global importance scores, finally, perform passage retrieval via personalized PageRank on the entity-passage subgraph while fixing sentence nodes.
  • Figure 4: Ablation study on key modules of LinearRAG under four different datasets. The y-axis represents the average of GPT-Acc. and Contain-Acc.
  • Figure 5: Parameter analysis of LinearRAG performance in the 2WikiMultiHopQA dataset. (a) shows the dependency of LinearRAG performance on threshold $\delta$ in dynamic pruning. (b) examines the effect of trade-off coefficient $\lambda$.