Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG
Nicholas Alonso, Beren Millidge
TL;DR
The paper tackles the computational burden of frontier long-context LLMs by replacing full-context processing with a retrieval-based approach. It introduces MixPR, a Sparse Mixture of PageRanks retriever that builds a graph over long-context chunks from sparse TF-IDF embeddings and retrieves via a Personalized PageRank scheme. It uses a sparse adjacency and a one-hot personalization vector that emphasizes recent chunks, along with a dynamic alpha to balance local versus global retrieval; retrieval runs on CPU with up to 18 iterations. Across 22 tasks on BABILong, RULER, Hash-Hop, and Eng.Sum, MixPR augments multiple LLMs to achieve SOTA or near-SOTA while significantly reducing compute, enabling on-device, real-time long-context processing. This work demonstrates a practical path to scalable long-context reasoning by decoupling retrieval from expensive attention mechanisms.
Abstract
Recent advances have extended the context window of frontier LLMs dramatically, from a few thousand tokens up to millions, enabling entire books and codebases to fit into context. However, the compute costs of inferencing long-context LLMs are massive and often prohibitive in practice. RAG offers an efficient and effective alternative: retrieve and process only the subset of the context most important for the current task. Although promising, recent work applying RAG to long-context tasks has two core limitations: 1) there has been little focus on making the RAG pipeline compute efficient, and 2) such works only test on simple QA tasks, and their performance on more challenging tasks is unclear. To address this, we develop an algorithm based on PageRank, a graph-based retrieval algorithm, which we call mixture-of-PageRanks (MixPR). MixPR uses a mixture of PageRank-based graph-retrieval algorithms implemented using sparse matrices for efficent, cheap retrieval that can deal with a variety of complex tasks. Our MixPR retriever achieves state-of-the-art results across a wide range of long-context benchmark tasks, outperforming both existing RAG methods, specialized retrieval architectures, and long-context LLMs despite being far more compute efficient. Due to using sparse embeddings, our retriever is extremely compute efficient, capable of embedding and retrieving millions of tokens within a few seconds and runs entirely on CPU.
