Table of Contents
Fetching ...

GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models

Guowei Xu, Wenxin Xu, Jiawang Zhao, Kaisheng Ma

TL;DR

Diffusion language models enable parallel sequence refinement but pose challenges for supervised fine-tuning due to uncertain token-level probabilities. The authors introduce GIFT, an entropy-guided, importance-aware fine-tuning method that uses token-wise entropy to assign per-token masking rates and weights, yielding a diffusion-consistent, theoretically grounded loss. Empirical results across 1k–10k data scales, with LoRA and full-parameter fine-tuning on base and instruct models, show consistent improvements over standard SFT on four reasoning benchmarks (Sudoku, Countdown, GSM8K, MATH-500) while maintaining or improving time efficiency. The approach underscores the value of entropy-based token prioritization for training stability and effectiveness in diffusion-based language models. Limitations include reliance on a fixed diffusion generator in the derivation and a scope limited to certain datasets and model scales, suggesting avenues for broader validation and dynamic Q adaptations in future work.

Abstract

Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose GIFT, an importance-aware finetuning method for diffusion language models, where tokens are assigned different importance weights based on their entropy. Derived from diffusion theory, GIFT delivers substantial gains: across diverse settings including different mainstream training datasets ranging from 1k to 10k in size, utilizing LoRA or full parameter fine-tuning, and training on base or instruct models, GIFT consistently achieves superior overall performance compared to standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500).

GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models

TL;DR

Diffusion language models enable parallel sequence refinement but pose challenges for supervised fine-tuning due to uncertain token-level probabilities. The authors introduce GIFT, an entropy-guided, importance-aware fine-tuning method that uses token-wise entropy to assign per-token masking rates and weights, yielding a diffusion-consistent, theoretically grounded loss. Empirical results across 1k–10k data scales, with LoRA and full-parameter fine-tuning on base and instruct models, show consistent improvements over standard SFT on four reasoning benchmarks (Sudoku, Countdown, GSM8K, MATH-500) while maintaining or improving time efficiency. The approach underscores the value of entropy-based token prioritization for training stability and effectiveness in diffusion-based language models. Limitations include reliance on a fixed diffusion generator in the derivation and a scope limited to certain datasets and model scales, suggesting avenues for broader validation and dynamic Q adaptations in future work.

Abstract

Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose GIFT, an importance-aware finetuning method for diffusion language models, where tokens are assigned different importance weights based on their entropy. Derived from diffusion theory, GIFT delivers substantial gains: across diverse settings including different mainstream training datasets ranging from 1k to 10k in size, utilizing LoRA or full parameter fine-tuning, and training on base or instruct models, GIFT consistently achieves superior overall performance compared to standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500).

Paper Structure

This paper contains 29 sections, 4 theorems, 43 equations, 3 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Assuming the $Q$ matrix takes the form given in Equation eq:q, let the initial sequence be $x_0$ and the sequence at time $t$ be $x_t$. Under this setting, the $i$-th token is masked with probability $t_i = 1 - (1-t)^{\tfrac{\beta_{x^i}}{\beta_{\text{ref}}}}$, where $\beta_{x^i}$ denotes the masking

Figures (3)

  • Figure 1: (a) The SFT pipeline. A timestep $t$ is uniformly sampled from $[0,1]$, and each token is masked independently with probability $t$. The training objective is to predict the masked tokens accurately based on the unmasked ones. (b) The GIFT pipeline. In each training step, we perform two forward passes. During the first forward pass, we mask the entire answer and estimate the masking rate $\beta_i$ for each token by computing its predictive entropy. In the second forward pass, the $i$-th token is masked with probability $t_i$ (computed from $\beta_i$), and its training weight is set to $\tfrac{1}{t_i}$. Tokens with higher entropy are more likely to be masked and thus receive stronger training signals.
  • Figure 2: Visualization of high-frequency tokens with different entropy levels. Tokens with higher entropy correspond to those where the model exhibits greater uncertainty, and are learned more frequently during training.
  • Figure 3: Reward curves of models cold-started with GIFT and SFT during subsequent reinforcement learning training. The curves are smoothed using a time-weighted EMA. As shown, the model initialized with GIFT achieves higher rewards.

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof