Table of Contents
Fetching ...

Fine-tuning with RAG for Improving LLM Learning of New Skills

Humaid Ibrahim, Nikolai Rozanov, Marek Rei

TL;DR

The paper addresses the overhead of retrieval-augmented generation in interactive LLM agents by turning runtime guidance into training-time competence. It introduces a failure-driven one-shot retrieval distillation pipeline that extracts generalizable hints from agent failures, uses them to generate improved teacher trajectories, and distills these into student models with hints removed, enabling internalization of guidance. Across ALFWorld and WebShop, distilled models achieve high success/score (e.g., ALFWorld ~91% with 14B; WebShop ~72.4) while using far fewer tokens than retrieval-based teachers, and they generalize across ReAct/StateAct architectures and model scales. The approach eliminates permanent runtime dependencies on retrieval stores, offering a practical path to more efficient, robust, and scalable interactive agents.

Abstract

Large language model (LLM) agents deployed for multi-step tasks frequently fail in predictable ways: attempting actions with unmet preconditions, issuing redundant commands, or mishandling environment constraints. While retrieval-augmented generation (RAG) can improve performance by providing runtime guidance, it requires maintaining external knowledge databases and adds computational overhead at every deployment. We propose a simple pipeline that converts inference-time retrieval into learned competence through distillation. Our approach: (1) extracts compact, reusable hints from agent failures, (2) uses these hints to generate improved teacher trajectories via one-shot retrieval at episode start, and (3) trains student models on these trajectories with hint strings removed, forcing internalization rather than memorization. Across two interactive benchmarks, ALFWorld (household tasks) and WebShop (online shopping), distilled students consistently outperform baseline agents, achieving up to 91% success on ALFWorld (vs. 79% for baselines) and improving WebShop scores to 72 (vs. 61 for baselines), while using 10-60% fewer tokens than retrieval-augmented teachers depending on the environment. The approach generalizes across model scales (7B/14B parameters) and agent architectures (ReAct/StateAct), demonstrating that retrieval benefits can be effectively internalized through targeted fine-tuning without permanent runtime dependencies.

Fine-tuning with RAG for Improving LLM Learning of New Skills

TL;DR

The paper addresses the overhead of retrieval-augmented generation in interactive LLM agents by turning runtime guidance into training-time competence. It introduces a failure-driven one-shot retrieval distillation pipeline that extracts generalizable hints from agent failures, uses them to generate improved teacher trajectories, and distills these into student models with hints removed, enabling internalization of guidance. Across ALFWorld and WebShop, distilled models achieve high success/score (e.g., ALFWorld ~91% with 14B; WebShop ~72.4) while using far fewer tokens than retrieval-based teachers, and they generalize across ReAct/StateAct architectures and model scales. The approach eliminates permanent runtime dependencies on retrieval stores, offering a practical path to more efficient, robust, and scalable interactive agents.

Abstract

Large language model (LLM) agents deployed for multi-step tasks frequently fail in predictable ways: attempting actions with unmet preconditions, issuing redundant commands, or mishandling environment constraints. While retrieval-augmented generation (RAG) can improve performance by providing runtime guidance, it requires maintaining external knowledge databases and adds computational overhead at every deployment. We propose a simple pipeline that converts inference-time retrieval into learned competence through distillation. Our approach: (1) extracts compact, reusable hints from agent failures, (2) uses these hints to generate improved teacher trajectories via one-shot retrieval at episode start, and (3) trains student models on these trajectories with hint strings removed, forcing internalization rather than memorization. Across two interactive benchmarks, ALFWorld (household tasks) and WebShop (online shopping), distilled students consistently outperform baseline agents, achieving up to 91% success on ALFWorld (vs. 79% for baselines) and improving WebShop scores to 72 (vs. 61 for baselines), while using 10-60% fewer tokens than retrieval-augmented teachers depending on the environment. The approach generalizes across model scales (7B/14B parameters) and agent architectures (ReAct/StateAct), demonstrating that retrieval benefits can be effectively internalized through targeted fine-tuning without permanent runtime dependencies.

Paper Structure

This paper contains 31 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Training Pipeline. Stage A represents the initial base agent run. Stage B is the hint extraction using failures of the agent. Stage C is running the RAG agent. Stage D is training the models based on regular supervised fine-tuning (SFT) and our distillation method.
  • Figure 2: Hint distillation process. We remove the hint block and few-shot examples, since these are constant per task and provide no useful training signal. The end result is a trained adapter. During inference, we combine the adapter with the base agent to get our trained student.
  • Figure 3: Accuracy–efficiency trade-off on ALFWorld and WebShop. The x-axis shows average tokens per episode, including retrieval overhead when used. Shapes denote training regime, Base (circle), RAG (cross), SFT (square), Distillation (diamond). Each point is a method variant (ReAct/StateAct). Shaded ellipses are provided to visually group variants. Note: Act refers to ReAct without "Thought" tokens.