Table of Contents
Fetching ...

Tuning LLMs by RAG Principles: Towards LLM-native Memory

Jiale Wei, Shuchi Wu, Ruochen Liu, Xiang Ying, Jingbo Shang, Fangbo Tao

TL;DR

This work investigates memory in LLMs by systematically comparing long-context and retrieval-augmented approaches, finding that long-context models excel at global reasoning while RAG excels at local, retrieval-driven queries. Building on these insights, the authors introduce RAG-Tuned-LLM, a method that synthesizes training data from GraphRAG using entity- and relationship-based templates, then fine-tunes a relatively small LLM via LoRA to encode memory directly in model parameters. Across three datasets, RAG-Tuned-LLM outperforms both standard RAG and long-context baselines on local and global queries, demonstrating strong memory capabilities without external retrieval during inference. The approach bridges open-domain and domain-specific QA, offering a memory-efficient, scalable path for LLM-native reasoning with hierarchical knowledge, and points to future extensions across domains and modalities.

Abstract

Memory, additional information beyond the training of large language models (LLMs), is crucial to various real-world applications, such as personal assistant. The two mainstream solutions to incorporate memory into the generation process are long-context LLMs and retrieval-augmented generation (RAG). In this paper, we first systematically compare these two types of solutions on three renovated/new datasets and show that (1) long-context solutions, although more expensive, shall be easier to capture the big picture and better answer queries which require considering the memory as a whole; and (2) when the queries concern specific information, RAG solutions shall be more competitive especially when the keywords can be explicitly matched. Therefore, we propose a novel method RAG-Tuned-LLM which fine-tunes a relative small (e.g., 7B) LLM using the data generated following the RAG principles, so it can combine the advantages of both solutions. Extensive experiments on three datasets demonstrate that RAG-Tuned-LLM can beat long-context LLMs and RAG methods across a wide range of query types.

Tuning LLMs by RAG Principles: Towards LLM-native Memory

TL;DR

This work investigates memory in LLMs by systematically comparing long-context and retrieval-augmented approaches, finding that long-context models excel at global reasoning while RAG excels at local, retrieval-driven queries. Building on these insights, the authors introduce RAG-Tuned-LLM, a method that synthesizes training data from GraphRAG using entity- and relationship-based templates, then fine-tunes a relatively small LLM via LoRA to encode memory directly in model parameters. Across three datasets, RAG-Tuned-LLM outperforms both standard RAG and long-context baselines on local and global queries, demonstrating strong memory capabilities without external retrieval during inference. The approach bridges open-domain and domain-specific QA, offering a memory-efficient, scalable path for LLM-native reasoning with hierarchical knowledge, and points to future extensions across domains and modalities.

Abstract

Memory, additional information beyond the training of large language models (LLMs), is crucial to various real-world applications, such as personal assistant. The two mainstream solutions to incorporate memory into the generation process are long-context LLMs and retrieval-augmented generation (RAG). In this paper, we first systematically compare these two types of solutions on three renovated/new datasets and show that (1) long-context solutions, although more expensive, shall be easier to capture the big picture and better answer queries which require considering the memory as a whole; and (2) when the queries concern specific information, RAG solutions shall be more competitive especially when the keywords can be explicitly matched. Therefore, we propose a novel method RAG-Tuned-LLM which fine-tunes a relative small (e.g., 7B) LLM using the data generated following the RAG principles, so it can combine the advantages of both solutions. Extensive experiments on three datasets demonstrate that RAG-Tuned-LLM can beat long-context LLMs and RAG methods across a wide range of query types.

Paper Structure

This paper contains 32 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of our RAG-Tuned-LLM method. Stage 1: RAG provides the foundation for synthesizing training data (query-answer pairs) for fine-tuning. Stage 2: The synthesized data is used to fine-tune a large language model (LLM) via LoRA. Stage 3: Inference is performed exclusively with LLM-native memory, eliminating the need for external memory. The RAG-Tuned-LLM combines the strengths of LLM-native solutions and RAG methods.
  • Figure 2: Overview of the data synthesis process used in RAG-Tuned-LLM. Global data synthesis comprises entity-based and relationship-based data synthesis, which generates query-answer pairs through the integration of templates and LLMs. Local data synthesis generates query-answer pairs using text units enriched by entries and relationships, along with LLMs.
  • Figure 3: The comparison among RAG-Tuned-LLM models trained with different synthesized data types, i.e., local split, global split, and both. We evaluate the models on local and global queries separately to ablate the effect of training data.
  • Figure 4: A concrete example (Case 1) from the News dataset illustrating the superiority of RAG-Tuned-LLM compared to GraphRAG.
  • Figure 5: A concrete example (Case 2) from the News dataset illustrating the superiority of RAG-Tuned-LLM compared to GraphRAG.