Table of Contents
Fetching ...

Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG

Nicholas Alonso, Beren Millidge

TL;DR

The paper tackles the computational burden of frontier long-context LLMs by replacing full-context processing with a retrieval-based approach. It introduces MixPR, a Sparse Mixture of PageRanks retriever that builds a graph over long-context chunks from sparse TF-IDF embeddings and retrieves via a Personalized PageRank scheme. It uses a sparse adjacency and a one-hot personalization vector that emphasizes recent chunks, along with a dynamic alpha to balance local versus global retrieval; retrieval runs on CPU with up to 18 iterations. Across 22 tasks on BABILong, RULER, Hash-Hop, and Eng.Sum, MixPR augments multiple LLMs to achieve SOTA or near-SOTA while significantly reducing compute, enabling on-device, real-time long-context processing. This work demonstrates a practical path to scalable long-context reasoning by decoupling retrieval from expensive attention mechanisms.

Abstract

Recent advances have extended the context window of frontier LLMs dramatically, from a few thousand tokens up to millions, enabling entire books and codebases to fit into context. However, the compute costs of inferencing long-context LLMs are massive and often prohibitive in practice. RAG offers an efficient and effective alternative: retrieve and process only the subset of the context most important for the current task. Although promising, recent work applying RAG to long-context tasks has two core limitations: 1) there has been little focus on making the RAG pipeline compute efficient, and 2) such works only test on simple QA tasks, and their performance on more challenging tasks is unclear. To address this, we develop an algorithm based on PageRank, a graph-based retrieval algorithm, which we call mixture-of-PageRanks (MixPR). MixPR uses a mixture of PageRank-based graph-retrieval algorithms implemented using sparse matrices for efficent, cheap retrieval that can deal with a variety of complex tasks. Our MixPR retriever achieves state-of-the-art results across a wide range of long-context benchmark tasks, outperforming both existing RAG methods, specialized retrieval architectures, and long-context LLMs despite being far more compute efficient. Due to using sparse embeddings, our retriever is extremely compute efficient, capable of embedding and retrieving millions of tokens within a few seconds and runs entirely on CPU.

Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG

TL;DR

The paper tackles the computational burden of frontier long-context LLMs by replacing full-context processing with a retrieval-based approach. It introduces MixPR, a Sparse Mixture of PageRanks retriever that builds a graph over long-context chunks from sparse TF-IDF embeddings and retrieves via a Personalized PageRank scheme. It uses a sparse adjacency and a one-hot personalization vector that emphasizes recent chunks, along with a dynamic alpha to balance local versus global retrieval; retrieval runs on CPU with up to 18 iterations. Across 22 tasks on BABILong, RULER, Hash-Hop, and Eng.Sum, MixPR augments multiple LLMs to achieve SOTA or near-SOTA while significantly reducing compute, enabling on-device, real-time long-context processing. This work demonstrates a practical path to scalable long-context reasoning by decoupling retrieval from expensive attention mechanisms.

Abstract

Recent advances have extended the context window of frontier LLMs dramatically, from a few thousand tokens up to millions, enabling entire books and codebases to fit into context. However, the compute costs of inferencing long-context LLMs are massive and often prohibitive in practice. RAG offers an efficient and effective alternative: retrieve and process only the subset of the context most important for the current task. Although promising, recent work applying RAG to long-context tasks has two core limitations: 1) there has been little focus on making the RAG pipeline compute efficient, and 2) such works only test on simple QA tasks, and their performance on more challenging tasks is unclear. To address this, we develop an algorithm based on PageRank, a graph-based retrieval algorithm, which we call mixture-of-PageRanks (MixPR). MixPR uses a mixture of PageRank-based graph-retrieval algorithms implemented using sparse matrices for efficent, cheap retrieval that can deal with a variety of complex tasks. Our MixPR retriever achieves state-of-the-art results across a wide range of long-context benchmark tasks, outperforming both existing RAG methods, specialized retrieval architectures, and long-context LLMs despite being far more compute efficient. Due to using sparse embeddings, our retriever is extremely compute efficient, capable of embedding and retrieving millions of tokens within a few seconds and runs entirely on CPU.

Paper Structure

This paper contains 10 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Comparison of gpt-4o-mini and RAG systems (@k=100) that use gpt-4o-mini as the generator. (Left) Our MixPR outperforms baseline RAG models and the base LLM on multiple standard long-context benchmarks.(Right) RAG models, like our MixPR, use only a small portion of input text, drastically reducing compute cost of LLM inference.
  • Figure 2: RAG Methods for Long-Context Tasks. Previous works AlonsoMillBabilongBlogyu2024defense have shown that chronologically ordering text chunks, rather than rank ordering, is necessary for many long-context tasks. Standard nearest-neighbor RAG (top) retrieve items most similar to the query (query-relatedness is depicted with red embeddings). Our PageRank-based retrievers represent relations between text chunks using a similarity matrix that can be computed cheaply. PageRank (bottom) retrieves items that have the highest importance according to the graph structure (structural importance depicted with blue). Personalized PageRank (middle) balances query-relatedness with structural importance.
  • Figure 3: PPR Retriever Alpha Test. (Left) The recall accuracy of a PPR retriever at k=100, under various alpha settings, for a set of query-dependent, local retrieval tasks. (Right) The recall performance of the same PPR retrieval at k=100, for query-independent, global retrieval tasks.
  • Figure 4: Performance on multi-hop retrieval tasks. Results from benchmarks on the subset of tasks that involve multi-hop retrieval: BABILong question types 2 and 3, Hash-Hop with 2-6 hash links, and the variable tracing task from RULER. All RAG models tested with k=100. The non-graph baseline RAG models struggle, while the MixPR model is effective across all tasks improving performance over RAG baselines by as much as $50\%$. Note that although variable tracing is intended to require multi-hop retrieval hsieh2024ruler the RAG-hybrid model is able to 'shortcut' this task in a single-hop retrieval step. See experiments section for discussion.
  • Figure 5: Performance on global retrieval tasks. (Left) Performance of various RAG models with k=100 averaged across the cwe and fwe tasks from RULER. (Right) Performance of various RAG models on the Eng.Sum task from infinite-bench. Models test with k of 100, 200, 300, 400, 500.
  • ...and 6 more figures