Table of Contents
Fetching ...

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework

Chao Zhang, Yuhao Wang, Derong Xu, Haoxin Zhang, Yuanjie Lyu, Yuhao Chen, Shuochen Liu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, Enhong Chen

TL;DR

TeaRAG addresses token inefficiency in agentic retrieval-augmented generation by jointly increasing information density per retrieval through a Knowledge Association Graph and by reducing reasoning steps via process-aware IP-DPO. It builds a large-scale knowledge graph from Wikipedia, fuses semantic chunks with triplets through a KAG, and applies Personalized PageRank to filter noise, producing concise yet informative contexts. The two-stage training framework—supervised fine-tuning followed by iterative preference optimization with knowledge-matching rewards—enables robust, concise reasoning across six QA benchmarks, with notable gains in EM and substantial reductions in output tokens and reasoning steps. The approach demonstrates strong out-of-domain performance, scalability across model sizes, and practical efficiency improvements for real-world RAG deployments.

Abstract

Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models' (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework

TL;DR

TeaRAG addresses token inefficiency in agentic retrieval-augmented generation by jointly increasing information density per retrieval through a Knowledge Association Graph and by reducing reasoning steps via process-aware IP-DPO. It builds a large-scale knowledge graph from Wikipedia, fuses semantic chunks with triplets through a KAG, and applies Personalized PageRank to filter noise, producing concise yet informative contexts. The two-stage training framework—supervised fine-tuning followed by iterative preference optimization with knowledge-matching rewards—enables robust, concise reasoning across six QA benchmarks, with notable gains in EM and substantial reductions in output tokens and reasoning steps. The approach demonstrates strong out-of-domain performance, scalability across model sizes, and practical efficiency improvements for real-world RAG deployments.

Abstract

Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models' (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at https://github.com/Applied-Machine-Learning-Lab/TeaRAG.

Paper Structure

This paper contains 40 sections, 15 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: $T_i$ denotes the thinking tokens at the i-th step, $R_i$ denotes the retrieved context at the i-th step, and $O$ represents the final output. (a) illustrates Search-R1, a representative agentic RAG method optimized based on the final outcome. (b) shows our proposed method TeaRAG, which achieves a token-efficient agentic RAG by optimizing the retrieved content length with high-density triplets and controlling the number of LLM reasoning steps via a process-aware reward.
  • Figure 2: (a) shows the token usage. (b) shows the distribution of reasoning steps. (c) shows the F1 performance on single-hop, multi-hop, and overall QA benchmarks.
  • Figure 3: The overall pipeline of TeaRAG. Based on an offline-built knowledge graph and chunk corpus index, TeaRAG progressively constructs a reasoning path until the final answer is determined.
  • Figure 4: (a) shows the structure of a reasoning step. <Reference> and </Reference> are special tokens for wrapping retrieved information. (b) shows the meaning of symbol in (c). (c) shows an example of partial KAG. The answer is highlighted by pink text. The red number on nodes is the ranking of the content selected in PPR.
  • Figure 5: The overall training framework for TeaRAG follows a two-stage paradigm. First, we conduct SFT on the preprocessed MuSiQue dataset to help the LLM learn the required format and develop basic reasoning skills. In the second stage, we apply IP-DPO with a process-aware reward to further improve the model while preventing overthinking.
  • ...and 8 more figures

Theorems & Definitions (2)

  • definition 1
  • definition 2