Table of Contents
Fetching ...

Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina, Hervé Déjean, Stéphane Clinchant

Abstract

While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge. Current approaches typically rely on either fine-tuning or training-free memory-augmented generation using retrieved experience; yet both have limitations: fine-tuning often fails to extrapolate to new tasks, while experience retrieval often underperforms compared to supervised baselines. In this work, we propose to combine these approaches and systematically study how to train retrieval-augmented LLM agents to effectively leverage retrieved trajectories in-context. First, we establish a robust supervised fine-tuning (SFT) recipe using LoRA that outperforms several state-of-the-art agent training pipelines. Second, we provide a detailed analysis of key design choices for experience retrieval, identifying optimal strategies for storage, querying, and trajectory selection. Finally, we propose a pipeline that integrates experience retrieval into the fine-tuning process. Our results demonstrate that this combined approach significantly improves generalization to unseen tasks, providing a scalable and effective framework for building agents that learn to learn from experience.

Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Abstract

While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge. Current approaches typically rely on either fine-tuning or training-free memory-augmented generation using retrieved experience; yet both have limitations: fine-tuning often fails to extrapolate to new tasks, while experience retrieval often underperforms compared to supervised baselines. In this work, we propose to combine these approaches and systematically study how to train retrieval-augmented LLM agents to effectively leverage retrieved trajectories in-context. First, we establish a robust supervised fine-tuning (SFT) recipe using LoRA that outperforms several state-of-the-art agent training pipelines. Second, we provide a detailed analysis of key design choices for experience retrieval, identifying optimal strategies for storage, querying, and trajectory selection. Finally, we propose a pipeline that integrates experience retrieval into the fine-tuning process. Our results demonstrate that this combined approach significantly improves generalization to unseen tasks, providing a scalable and effective framework for building agents that learn to learn from experience.
Paper Structure (73 sections, 2 equations, 7 figures, 16 tables)

This paper contains 73 sections, 2 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: ExpRAG Agent overview.ExpRAG augments an LLM agent with retrieval over past experience trajectories. Offline, an experience bank is constructed once by collecting agent rollouts and encoding each trajectory $\tau_i$ into a key embedding $\phi(\tau_i)$, forming an index of trajectories paired with their representations. During inference, the current task description and interaction history $h_t$ are encoded into a query, which is used to retrieve the top-$K$ most relevant trajectories from the bank. The retrieved trajectories are then assembled into a memory block $m_t$ and injected into the system prompt of the LLM policy $\pi_\theta$, which outputs the next action $a_t$. The environment returns a new observation $o_{t+1}$, which is appended to the interaction history. Retrieval may be performed only once at $t=0$ in the static setting, or refreshed throughout the episode in the dynamic setting. The loop is repeated, enabling continual experience-grounded decision making and improved generalization to unseen tasks.
  • Figure 2: Different trajectory formats: chat JSON, agentic JSON, compact JSON, and textual.
  • Figure 3: Longer fine-tuning can improve generalization despite rising validation loss. Comparison of validation loss and inference performance with respect to number oftraining epochs on ALFWorld. Blue: validation cross-entropy (left axis; lower is better). Green/orange: rollout success rate and average episode score (right axis; higher is better) evaluated at multiple checkpoints during 50-epoch fine-tuning. Top: easy$\rightarrow$easy (in-distribution). Bottom: easy$\rightarrow$hard (out-of-distribution). For ExpRAG-LoRA, retrieval uses a matched index for each evaluation split.
  • Figure 4: Longer fine-tuning can improve generalization despite rising validation loss. Comparison of validation loss and inference performance with respect to number oftraining epochs on ScienceWorld. Blue: validation cross-entropy (left axis). Green/orange: rollout success rate and average episode score (right axis) across checkpoints during 50-epoch fine-tuning. Top: easy$\rightarrow$easy (in-distribution). Bottom: easy$\rightarrow$hard (out-of-distribution). For ExpRAG-LoRA, retrieval uses a matched index for each evaluation split.
  • Figure 5: Example of a partial JSON trajectory in ALFWorld.
  • ...and 2 more figures