Table of Contents
Fetching ...

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich

Abstract

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Abstract

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.
Paper Structure (54 sections, 6 theorems, 83 equations, 21 figures, 6 tables, 1 algorithm)

This paper contains 54 sections, 6 theorems, 83 equations, 21 figures, 6 tables, 1 algorithm.

Key Result

Lemma 1.1

Consider the assumptions Then, the optimal conditional feature-matching loss profile $\mathbb{E}_{c \sim p}[ \mathrm{Var}[\phi_c(y)|c] ]$ is non-decreasing with $G$ and admits the bound

Figures (21)

  • Figure 1: Feature-matching loss grows with completion length. Conditional feature-matching loss (lower is better) as a function of completion length for Qwen2.5-1.5B fine-tuned with SFT on OpenCodeInstruct ahmad2025opencodeinstruct. Although this increase is expected even under a perfect model due to growing feature variance, part of the degradation reflects SFT's inability to calibrate the model's rollout distribution over long horizons.
  • Figure 2: EBFT achieves the lowest feature-matching loss across all completion lengths. Despite training with a rollout horizon of only 8 tokens, EBFT's gains persist and grow at longer completions. RLVR worsens this loss relative to the base model.
  • Figure 3: EBFT improves downstream performance without sacrificing distributional calibration. From left to right, we plot HumanEval accuracy (greedy and pass@16), validation cross-entropy (CE), and conditional feature-matching (CFM) loss over training for Qwen2.5-1.5B fine-tuned on OpenCodeInstruct ahmad2025opencodeinstruct. SFT improves cross-entropy and CFM loss but lags on downstream accuracy. RLVR improves downstream accuracy but substantially degrades both calibration metrics relative to the base model (dashed line). EBFT achieves the best results across all four metrics, avoiding this tradeoff. CE and CFM losses are computed on a 1k-samples held-out subset of OpenCodeInstruct.
  • Figure 4: Overview of Energy-Based Fine-Tuning (EBFT). For each context $c$, the generator $p_\theta$ samples $n$ completions. A frozen feature network $\phi$ embeds each prompt--completion pair, producing features $\phi(c\!:\!\hat{y}_j)$ for the sampled completions and $\phi(c\!:\!y)$ for the ground truth. Each completion receives a feature-matching reward measuring alignment with the ground-truth feature moment, and the generator is updated via REINFORCE with an RLOO baseline.
  • Figure 5: On translation, EBFT outperforms both SFT and RLVR on downstream accuracy, cross-entropy, and feature-matching loss. From left to right, we plot COMET scores on WMT22 and MTNT, validation cross-entropy, and CFM loss over training for Llama-3.2-1B fine-tuned on ALMA xu2023paradigm. EBFT achieves the lowest CE and CFM losses and matches SFT on WMT22 while clearly outperforming it on MTNT. RLVR underperforms SFT on all four metrics, with cross-entropy rising well above the base model (dashed line).
  • ...and 16 more figures

Theorems & Definitions (9)

  • Lemma 1.1: The optimal conditional feature-matching profile
  • proof
  • Lemma 2.1: Variational representation of the chi-squared divergence
  • proof
  • Lemma 2.2
  • Theorem 4.1: Thm. 2, domingoenrich2022dual
  • Theorem 4.2: Prop. 3, domingoenrich2022dual
  • Theorem 4.3
  • proof