Table of Contents
Fetching ...

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Zecheng Wang, Deyuan Liu, Chunshan Li, Yupeng Zhang, Zhengyun Zhao, Dianhui Chu, Bingning Wang, Dianbo Sui

TL;DR

This work tackles the suboptimal token-wise gradient allocation in supervised fine-tuning by introducing a unified deformed-log objective family that reveals a gate × error structure in learning signals. It derives a state-aware focus trajectory α^*(p) using the Cayley transform, enabling a smooth shift from coverage of uncertain knowledge to sharpening of confident predictions. The authors further instantiate a parameter-free Dynamic Entropy Fine-Tuning (DEFT) by leveraging distribution-level concentration via Rényi-2 entropy, yielding adaptive gating without extra hyperparameters. Empirical results across multiple backbones and tasks show DEFT and Cayley-Trans delivering consistent gains, particularly in strong-prior and weak-prior regimes, and demonstrate improved out-of-domain generalization. The approach offers a principled, information-theoretic path to balance exploration and exploitation in SFT with practical gains for robust model fine-tuning.

Abstract

Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

TL;DR

This work tackles the suboptimal token-wise gradient allocation in supervised fine-tuning by introducing a unified deformed-log objective family that reveals a gate × error structure in learning signals. It derives a state-aware focus trajectory α^*(p) using the Cayley transform, enabling a smooth shift from coverage of uncertain knowledge to sharpening of confident predictions. The authors further instantiate a parameter-free Dynamic Entropy Fine-Tuning (DEFT) by leveraging distribution-level concentration via Rényi-2 entropy, yielding adaptive gating without extra hyperparameters. Empirical results across multiple backbones and tasks show DEFT and Cayley-Trans delivering consistent gains, particularly in strong-prior and weak-prior regimes, and demonstrate improved out-of-domain generalization. The approach offers a principled, information-theoretic path to balance exploration and exploitation in SFT with practical gains for robust model fine-tuning.

Abstract

Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.
Paper Structure (34 sections, 19 theorems, 130 equations, 6 figures, 8 tables)

This paper contains 34 sections, 19 theorems, 130 equations, 6 figures, 8 tables.

Key Result

Theorem 1

Fix a token context $c$ and let $r(\cdot\mid c)$ be the true next-token distribution. For notational convenience within this theorem only, we write $x\equiv c$ in conditioning terms, and we use $q(\cdot\mid x)\equiv \hat{p}(\cdot\mid x)$ for the predicted distribution. Define the scoring rule $S_\al Moreover, for any model $\hat{p}$, Thus the loss index is $q_{\mathrm{loss}}=1-\alpha$ while the i

Figures (6)

  • Figure 1: Learning conflicts of different tokens under NLL during training (token-wise learning and forgetting). The scatter plot shows token probability ($\triangle P$) and loss ($\triangle L$) changes. Bar plots show the proportions of tokens, including overall learning and forgetting, the fractions of forgotten tokens with high and low confidence (Q2 in the scatter plot), and fractions of learning tokens with high and low confidence (Q4 in the scatter plot). Hatched patterns indicate high-confidence tokens.
  • Figure 2: Token-level gradient distributions across model capability regions. The x-axis shows token probability and the y-axis entropy, with color intensity indicating normalized gradient magnitude, illustrating how different loss functions affect token-wise learning. Top: model-strong; Medium: model-medium; bottom: model-weak.
  • Figure 3: Token probability distributions on the training set under different objectives in the Model-Stong regime.
  • Figure 4: Token probability distributions on the training set under different objectives in the Model-Weak regime.
  • Figure 5: Evolution of trust gate $a$ and average probability $p$ during Llama-3.1-8B training with DEFT. $a$ exhibits regime-specific initialization (e.g., 0.65 in Model-Strong vs. 0.05 in Model-Weak), reflecting DEFT's adaptive response to the model's varying levels of prior knowledge. Furthermore, as training progresses, $a$ increases steadily, facilitating a smooth transition from a coverage-oriented phase to a sharpening-oriented phase.
  • ...and 1 more figures

Theorems & Definitions (42)

  • Definition 1: $q$-logarithm
  • Definition 2: Tsallis entropy
  • Theorem 1: Optimization--entropy duality
  • Lemma 1: Softmax Jacobian
  • proof
  • Lemma 2: General Objective Gradient
  • proof
  • Proposition 1: Continuity and limit properties of $\ln_q$
  • proof
  • Theorem 2: Optimization-Entropy Duality, restating \ref{['thm:opt_entropy_duality']}
  • ...and 32 more