Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Zecheng Wang; Deyuan Liu; Chunshan Li; Yupeng Zhang; Zhengyun Zhao; Dianhui Chu; Bingning Wang; Dianbo Sui

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Zecheng Wang, Deyuan Liu, Chunshan Li, Yupeng Zhang, Zhengyun Zhao, Dianhui Chu, Bingning Wang, Dianbo Sui

TL;DR

This work tackles the suboptimal token-wise gradient allocation in supervised fine-tuning by introducing a unified deformed-log objective family that reveals a gate × error structure in learning signals. It derives a state-aware focus trajectory α^*(p) using the Cayley transform, enabling a smooth shift from coverage of uncertain knowledge to sharpening of confident predictions. The authors further instantiate a parameter-free Dynamic Entropy Fine-Tuning (DEFT) by leveraging distribution-level concentration via Rényi-2 entropy, yielding adaptive gating without extra hyperparameters. Empirical results across multiple backbones and tasks show DEFT and Cayley-Trans delivering consistent gains, particularly in strong-prior and weak-prior regimes, and demonstrate improved out-of-domain generalization. The approach offers a principled, information-theoretic path to balance exploration and exploitation in SFT with practical gains for robust model fine-tuning.

Abstract

Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

TL;DR

Abstract

error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.

Paper Structure (34 sections, 19 theorems, 130 equations, 6 figures, 8 tables)

This paper contains 34 sections, 19 theorems, 130 equations, 6 figures, 8 tables.

Introduction
Related Works
Token-level modeling and filtering.
Modifying SFT losses via probability/uncertainty-aware reweighting.
Learning Signal Allocation in SFT
Token-Level Objectives and Gradient Structure
The Dual Nature of Low Probability: Coverage vs. Sharpening
Trust Gating and Entropy Duality
Trust Gating via the Deformed-Log Family
The Optimization-Entropy Duality
Unifying Coverage and Sharpening
Geometric Anchors and the Cayley Transform
Normalized Surprisal to Distribution-Level Surprisal.
Main Experiments
Experimental Setup
...and 19 more sections

Key Result

Theorem 1

Fix a token context $c$ and let $r(\cdot\mid c)$ be the true next-token distribution. For notational convenience within this theorem only, we write $x\equiv c$ in conditioning terms, and we use $q(\cdot\mid x)\equiv \hat{p}(\cdot\mid x)$ for the predicted distribution. Define the scoring rule $S_\al Moreover, for any model $\hat{p}$, Thus the loss index is $q_{\mathrm{loss}}=1-\alpha$ while the i

Figures (6)

Figure 1: Learning conflicts of different tokens under NLL during training (token-wise learning and forgetting). The scatter plot shows token probability ($\triangle P$) and loss ($\triangle L$) changes. Bar plots show the proportions of tokens, including overall learning and forgetting, the fractions of forgotten tokens with high and low confidence (Q2 in the scatter plot), and fractions of learning tokens with high and low confidence (Q4 in the scatter plot). Hatched patterns indicate high-confidence tokens.
Figure 2: Token-level gradient distributions across model capability regions. The x-axis shows token probability and the y-axis entropy, with color intensity indicating normalized gradient magnitude, illustrating how different loss functions affect token-wise learning. Top: model-strong; Medium: model-medium; bottom: model-weak.
Figure 3: Token probability distributions on the training set under different objectives in the Model-Stong regime.
Figure 4: Token probability distributions on the training set under different objectives in the Model-Weak regime.
Figure 5: Evolution of trust gate $a$ and average probability $p$ during Llama-3.1-8B training with DEFT. $a$ exhibits regime-specific initialization (e.g., 0.65 in Model-Strong vs. 0.05 in Model-Weak), reflecting DEFT's adaptive response to the model's varying levels of prior knowledge. Furthermore, as training progresses, $a$ increases steadily, facilitating a smooth transition from a coverage-oriented phase to a sharpening-oriented phase.
...and 1 more figures

Theorems & Definitions (42)

Definition 1: $q$-logarithm
Definition 2: Tsallis entropy
Theorem 1: Optimization--entropy duality
Lemma 1: Softmax Jacobian
proof
Lemma 2: General Objective Gradient
proof
Proposition 1: Continuity and limit properties of $\ln_q$
proof
Theorem 2: Optimization-Entropy Duality, restating \ref{['thm:opt_entropy_duality']}
...and 32 more

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

TL;DR

Abstract

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (42)