Table of Contents
Fetching ...

Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner

Hao Ma, Zhiqiang Pu, Yang Liu, Xiaolin Ai

Abstract

Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textit{dynamic constraints} that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur. We implement this by using a reference model as an \textit{online refiner} that takes the response from the fine-tuned model and generates a minimally corrected version which preserves correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output. This mechanism yields a constraint that automatically strengthens or relaxes based on output quality. Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.

Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner

Abstract

Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textit{dynamic constraints} that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur. We implement this by using a reference model as an \textit{online refiner} that takes the response from the fine-tuned model and generates a minimally corrected version which preserves correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output. This mechanism yields a constraint that automatically strengthens or relaxes based on output quality. Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.
Paper Structure (21 sections, 9 equations, 11 figures, 4 tables)

This paper contains 21 sections, 9 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An illustration of the insight of dynamic constraint. (a) The left figure illustrates the conventional KL regularization, which constrains the fine-tuned policy $\pi_{\theta}$ to remain close to the reference policy $\pi_{0}$. However, when the optimal policy lies far from $\pi_{0}$ and requires $\pi_{\theta}$ to deviate significantly, the KL regularization becomes an obstacle to policy optimization. (b) The right figure presents the insight of the dynamic constraint, where the constraint is derived from $\pi_{0}$’s refined response based on $\pi_{\theta}$'s response and context. When $\pi_{\theta}$ deviates substantially from $\pi_{0}$, the dynamic constraint not only avoids hindering policy optimization but also provides effective guidance and correction for $\pi_{\theta}$.
  • Figure 2: The pipeline for calculating a dynamic constraint. After $\pi_\theta$ generates a response $a$ for the query $s_0$, a refiner LLM is employed to refine the response based on the necessary context (quesry $s_0$, response $a$) and a predefined template. The refinement process is conservative. In most cases, the refiner simply repeats $a$ when no obvious issues are detected, while in other cases it makes only minimal edits to $a$ when necessary. The detailed template is provided in Appendix \ref{['appendix:exp_details']}. The dynamic constraint is computed as a cross entropy loss that treats the refined response $a'$ as the ground truth, which is equivalent to the SFT loss on the pair $\langle s_0, a' \rangle$.
  • Figure 3: Dynamic constraint from a dataset perspective. The dynamic constraint can be interpreted as an RFT-SFT hybrid training approach with a dynamically updated SFT dataset.
  • Figure 4: Training dynamics on Prompt-Collection-v0.1 (top) and APPS (bottom). (a, d) Dynamic (orange) achieves continuous reward growth, whereas Static (blue) saturates and w/o constraint (red) collapses. (b, e) The KL divergence of Dynamic rises steadily, indicating deep exploration beyond the initial policy $\pi_0$, while Static remains tethered. (c, f) The cross entropy remains low, confirming that the Refiner $\pi_{\text{refiner}}$ successfully tracks the evolving policy $\pi_\theta$.
  • Figure 5: Training curves compared with DAPO on the APPS dataset. The right figure shows the proportion of rollouts improved by the Refiner LLM.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2