Table of Contents
Fetching ...

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin

TL;DR

GOLF is proposed, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements and jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities.

Abstract

Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

TL;DR

GOLF is proposed, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements and jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities.

Abstract

Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2 improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.
Paper Structure (58 sections, 9 equations, 8 figures, 12 tables)

This paper contains 58 sections, 9 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: An illustration of RL with natural language feedabck. Compared with scalar-reward RL (top), aggregating intra-group and external feedback turns sparse outcomes into actionable refinement signals, enabling guided exploration (bottom).
  • Figure 2: An overview of Golf, which consists of three components. The policy first rollouts a group of candidates and receives both scalar rewards and external critiques. Golf then aggregates the critiques with the failed trajectories in the same group to form group-level NL feedback, which conditions a refinement stage to produce improved responses. Finally, high-quality refinements are adaptively injected back into the rollout group as off-policy guidance, mitigating low-reward regimes. Both generation and refinement are optimized jointly within a unified RL loop.
  • Figure 3: Evaluation performance over training steps. We report the LC win rate on AlpacaEval v2.0 (left), WildBench score (middle), and ArenaHard v2.0 win rate (right). The baseline refers to Pairwise-GRPO, which uses the same generative reward model as Golf.
  • Figure 4: Pass@$k$ comparison between GRPO and Golf on mathematical reasoning benchmarks using Qwen-3-8B.
  • Figure 5: Ablation on feedback sources. We ablate intra-group attempts or external critiques from the aggregated refinement context. Bars report average performance over the non-verifiable, math reasoning, and instruction following suites and we provide per-benchmark results in Appendix \ref{['appendix:ablation_feedback']}.
  • ...and 3 more figures