Table of Contents
Fetching ...

Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge, Mai Nishimura, Jiaxin Ma

TL;DR

This work tackles enabling agentic retrieval-augmented generation (RAG) in extremely compact language models by addressing cold-start instability and exposure bias. It introduces Distillation-Guided Policy Optimization (DGPO), a two-phase framework that starts with cold-start KD from teacher-generated outputs and then transitions to RL with selective teacher guidance, stabilizing training and improving agentic behaviors. To diagnose and interpret these behaviors, the authors propose Agentic RAG Capabilities (ARC), a fine-grained suite assessing Thinking, Source Referencing, and Query Rewriting. Empirical results show DGPO consistently outperforms baselines on seven QA benchmarks, with the compact student sometimes surpassing the teacher, and indicate DGPO enables practical agentic RAG in compute-constrained settings. The ARC framework provides actionable insight into which components (e.g., multi-hop reasoning, evidence integration) drive gains, supporting scalable deployment of agentic RAG on lightweight devices.

Abstract

Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g., 0.5--1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

TL;DR

This work tackles enabling agentic retrieval-augmented generation (RAG) in extremely compact language models by addressing cold-start instability and exposure bias. It introduces Distillation-Guided Policy Optimization (DGPO), a two-phase framework that starts with cold-start KD from teacher-generated outputs and then transitions to RL with selective teacher guidance, stabilizing training and improving agentic behaviors. To diagnose and interpret these behaviors, the authors propose Agentic RAG Capabilities (ARC), a fine-grained suite assessing Thinking, Source Referencing, and Query Rewriting. Empirical results show DGPO consistently outperforms baselines on seven QA benchmarks, with the compact student sometimes surpassing the teacher, and indicate DGPO enables practical agentic RAG in compute-constrained settings. The ARC framework provides actionable insight into which components (e.g., multi-hop reasoning, evidence integration) drive gains, supporting scalable deployment of agentic RAG on lightweight devices.

Abstract

Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g., 0.5--1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

Paper Structure

This paper contains 55 sections, 7 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Distillation-Guided Policy Optimization. Top: Compact models struggle to earn rewards due to poor capability, which leads to training collapse. Bottom: DGPO establishes a stable reward mechanism by guiding incorrect answers through teacher mimicry.
  • Figure 2: Agentic RAG capability. We introduce Agentic RAG Capability (ARC) which characterizes the core capabilities of LLMs required for agentic RAG systems. ARC is evaluated as three primary components: thinking, query rewriting, and source referencing.
  • Figure 3: Comparison of prompt-based and RL-based (PPO) post-training agentic RAG across model sizes.
  • Figure 4: Top: Standard PPO pipeline for post-training LLMs. The reference LLM serves as a regularization anchor to prevent excessive deviation from the initial policy. Bottom: Our proposed distillation-guided PPO pipeline. Unlike conventional approaches where the reference model merely constrains policy drift, our framework employs the teacher model to actively guide the student toward correct behaviors when autonomous attempts fail, transforming the reference's role from passive regularization to active pedagogical guidance.
  • Figure 5: Training curve of PPO and GRPO.
  • ...and 1 more figures