Table of Contents
Fetching ...

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

TL;DR

This paper questions the universality of negative log likelihood (NLL) as the objective for supervised fine-tuning (SFT) in post-training large language models. It introduces a general probability-based objective family $f(p)$ that includes NLL as a limit and analyzes how objective shape interacts with the model-capability continuum, distinguishing model-strong and model-weak regimes. Through extensive experiments across 7 backbones, 14 benchmarks, and 3 domains, plus theoretical gradient-flow analysis, it shows that prior-leaning objectives like $-p$ excel when priors are strong, while NLL dominates under weak priors, with an intermediate region where no single objective is best. The work proposes adaptive objective strategies that align learning signals with model priors and task priors, offering a principled path to improve SFT generalization and prompting future exploration of curriculum-style objective adaptation in post-training.

Abstract

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

TL;DR

This paper questions the universality of negative log likelihood (NLL) as the objective for supervised fine-tuning (SFT) in post-training large language models. It introduces a general probability-based objective family that includes NLL as a limit and analyzes how objective shape interacts with the model-capability continuum, distinguishing model-strong and model-weak regimes. Through extensive experiments across 7 backbones, 14 benchmarks, and 3 domains, plus theoretical gradient-flow analysis, it shows that prior-leaning objectives like excel when priors are strong, while NLL dominates under weak priors, with an intermediate region where no single objective is best. The work proposes adaptive objective strategies that align learning signals with model priors and task priors, offering a principled path to improve SFT generalization and prompting future exploration of curriculum-style objective adaptation in post-training.

Abstract

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., , , thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

Paper Structure

This paper contains 30 sections, 10 theorems, 58 equations, 4 figures, 7 tables.

Key Result

Lemma 1

Let $f:[0,1]\!\to\!\mathbb{R}$ be differentiable and nonincreasing. Then the gradient of Eq. eq:general_obj with respect to the logits at step $t$ is In particular, for the correct class $i=y$,

Figures (4)

  • Figure 1: The model capability continuum of SFT objectives in Post-Training. At the model-strong (MS) end, where base models already encode extensive priors (e.g., Llama 3 reports 25% math pretraining tokens grattafiori2024llama), prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, or thresholded variants) consistently outperform NLL by up to 16%. At the model-weak (MW) end, where no useful priors exist (e.g., no figfont puzzles in pretraining data), the standard NLL dominates. In the model-intermediate (MI) region (e.g., medical reasoning, where models rely on partial world knowledge), the gap between objectives narrows and no single choice consistently prevails. This continuum highlights how the effectiveness of an SFT objective depends critically on the capability of the base model.
  • Figure 2: The logit gradients $W_f(p)$ of different functions.
  • Figure 3: Performance under quantile thresholding for $-\log(p)$, $-p$, and $\log(1-p)$. Let $Q_{\text{percentile}}$ denote the predicted probability at the specified percentile of the training set. ($\geq$ Percentile) corresponds to $I = [Q_{\text{percentile}}, 1]$ in Eq. \ref{['eq:hard_threshold']}, while ($\leq$ Percentile) corresponds to $I = [0, Q_{\text{percentile}}]$. Key findings: (1) low-probability tokens consistently harm performance across all objectives; (2) when training on all tokens, objectives that de-emphasize low-probability tokens ($-p$ and $\log(1-p)$) outperform $-\log(p)$; (3) restricting training to only the top 10% of tokens yields the strongest improvements across all objectives, surpassing standard SFT.
  • Figure 4: Analysis of MS and MW ends in terms of objective convexity (with Eq. \ref{['eq:example_alpha']}) and likelihood estimation. In MS, more concave (prior-leaning) objectives yield better downstream accuracy, while in MW, more convex (prior-averse) objectives dominate. The likelihood estimation results align with these trends, suggesting that objective shape directly interacts with model prior strength.

Theorems & Definitions (20)

  • Lemma 1: Gradient Shape
  • Proposition 1: Convex versus Concave Objectives
  • Definition 1: Prior-leaning versus Prior-adverse Objectives
  • Remark 1
  • Theorem 1: Characterization via Gradient Flow, Informal
  • Remark 2
  • Lemma 2: Gradient Shape
  • proof
  • Proposition 2: Convex versus Concave Objectives
  • proof
  • ...and 10 more