Table of Contents
Fetching ...

GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

TL;DR

GIFT introduces a unified, on-policy fine-tuning framework that merges GRPO, DPO, and UNA to align LLMs. By normalizing explicit and implicit rewards and minimizing their mean-squared error, it converts a non-convex reward-maximization problem into a convex objective with low-variance gradients, while preserving exploration. The method computes an explicit reward from a reward model and an implicit reward from an LLM policy via log-density ratios, then aligns them through group normalization and MSE loss: $L_{ ext{GIFT-reward}}(\pi_\theta) = \mathbb{E}_{(x,y)\sim D} [ ( r'_{\phi}(x,y) - \beta\, \hat{r}'_{\theta}(x,y) )^2 ]$. Empirical results on GSM8K and MATH with 7B and 32B models show faster convergence, reduced overfitting, and improved reasoning performance relative to GRPO and other baselines, demonstrating practical benefits for scalable, robust alignment.

Abstract

I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.

GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

TL;DR

GIFT introduces a unified, on-policy fine-tuning framework that merges GRPO, DPO, and UNA to align LLMs. By normalizing explicit and implicit rewards and minimizing their mean-squared error, it converts a non-convex reward-maximization problem into a convex objective with low-variance gradients, while preserving exploration. The method computes an explicit reward from a reward model and an implicit reward from an LLM policy via log-density ratios, then aligns them through group normalization and MSE loss: . Empirical results on GSM8K and MATH with 7B and 32B models show faster convergence, reduced overfitting, and improved reasoning performance relative to GRPO and other baselines, demonstrating practical benefits for scalable, robust alignment.

Abstract

I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.

Paper Structure

This paper contains 15 sections, 14 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of different optimization methods: (a). DPO: an offline method with a prompt, a desired response and a undesired response are provided to the LLM policy to generate the implicit reward and a BCE loss is utilized to optimize the policy; (b). UNA: an offline method with a prompt and a response are provided to the LLM policy to generated implicit reward function and the LLM policy is optimized by minimizing the difference between implicit and explicit reward function; (c). PPO: an online method to generate a response for a prompt and the explicit reward model generate a reward and optimize the LLM through policy gradient; (d). GRPO: an online method to generate multiple responses for a prompt and the explicit reward model generate a reward for each response and the normalized rewards are utilized to optimize the LLM through policy gradient; (e). GIFT: an online method to generate multiple responses for a prompt and the implicit and explicit reward model generate an implicit and explicit reward for each response and the normalized implicit and explicit rewards are utilized to optimize the LLM through minimizing the MSE.
  • Figure 2: (a) Impact of rollout numbers ($N=1,2,4,8,16,32$) during fine-tuning; (b) Comparison of implicit reward definitions: summation (kl_sum) vs. averaging (kl_average).
  • Figure 3: Comparison of GIFT and GRPO on DeepSeek-7B using GSM8K and MATH datasets. Training and evaluation curves show that GRPO exhibits stronger overfitting compared to GIFT.
  • Figure 4: Comparison of GIFT and GRPO on Qwen2.5-32B using GSM8K and MATH datasets. GIFT achieves faster convergence and better generalization.