Table of Contents
Fetching ...

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Xiaoyang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai, Jingkuan Song, Yang Yang, Heng Tao Shen

TL;DR

This paper introduces Adaptive Multi-Guidance Policy Optimization (AMPO), a Mixed-Policy RL framework that leverages multiple teacher models only when the on-policy student fails, addressing limited exploration in single-teacher GRPO-based RLVR. It deploys a comprehension-based guidance selection mechanism to choose the most learnable external reasoning paths, balancing exploration and exploitation. Empirically, AMPO outperforms GRPO across six in-distribution math tasks and three out-of-distribution benchmarks, achieving notable data efficiency and improved Pass@k diversity, with strong gains even when using smaller or different model families. The findings demonstrate that multi-teacher guidance, combined with on-demand intervention and comprehension-aware selection, provides a scalable path to improved reasoning and generalization in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

TL;DR

This paper introduces Adaptive Multi-Guidance Policy Optimization (AMPO), a Mixed-Policy RL framework that leverages multiple teacher models only when the on-policy student fails, addressing limited exploration in single-teacher GRPO-based RLVR. It deploys a comprehension-based guidance selection mechanism to choose the most learnable external reasoning paths, balancing exploration and exploitation. Empirically, AMPO outperforms GRPO across six in-distribution math tasks and three out-of-distribution benchmarks, achieving notable data efficiency and improved Pass@k diversity, with strong gains even when using smaller or different model families. The findings demonstrate that multi-teacher guidance, combined with on-demand intervention and comprehension-aware selection, provides a scalable path to improved reasoning and generalization in LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

Paper Structure

This paper contains 20 sections, 54 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The AMPO training framework. It enhances exploration by adaptively replacing on-policy failures with external solutions from a Multi-Guidance Pool only when sparse rewards occur. The selection of external guidance is prioritized based on the Policy Model's comprehension score for each option, ensuring effective learning.
  • Figure 2: Statistics of the average response length for different methods on the in-distribution dataset based on Qwen2.5-7B-Ins.
  • Figure 3: Pass@K Performance with different RL algorithms across several reasoning benchmarks.
  • Figure 4: Training Dynamic of rewards, response lengths, gradient norm and the training entropy during GRPO and AMPO training.
  • Figure 5: Training Dynamic of rewards, response lengths, gradient norm and the training entropy during AMPO training with different $k_0$.
  • ...and 1 more figures