Table of Contents
Fetching ...

VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

Xuan Gong, Senmiao Wang, Hanbo Huang, Ruoyu Sun, Shiyu Liang

TL;DR

VCORE tackles the inefficiency of uniform token supervision in long chain-of-thought SFT by reframing token weighting as a constrained optimization problem over token positions. It derives a closed-form Gibbs distribution $q^*(t|x,y,\theta) \propto \exp(\tau s_t(x,y,\theta))$, where $s_t$ is the gradient-utility of token $t$, and estimates all $s_t$ with a one-backward probing trick. To stabilize training, VCORE introduces a variance-control factor $\alpha = \sqrt{\mathcal{V}_u/\mathcal{V}_q}$ that aligns the update variance with that of uniform weighting. Empirically, VCORE outperforms DFT and iw-SFT on math and code benchmarks across multiple model scales and provides a stronger initialization for RL fine-tuning, enabling more robust reasoning generalization. This work offers a principled, scalable alternative to heuristic token weighting in long CoT SFT and highlights the benefits of integrating SGD dynamics and variance control into supervision strategies.

Abstract

Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.

VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

TL;DR

VCORE tackles the inefficiency of uniform token supervision in long chain-of-thought SFT by reframing token weighting as a constrained optimization problem over token positions. It derives a closed-form Gibbs distribution , where is the gradient-utility of token , and estimates all with a one-backward probing trick. To stabilize training, VCORE introduces a variance-control factor that aligns the update variance with that of uniform weighting. Empirically, VCORE outperforms DFT and iw-SFT on math and code benchmarks across multiple model scales and provides a stronger initialization for RL fine-tuning, enabling more robust reasoning generalization. This work offers a principled, scalable alternative to heuristic token weighting in long CoT SFT and highlights the benefits of integrating SGD dynamics and variance control into supervision strategies.

Abstract

Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.

Paper Structure

This paper contains 32 sections, 8 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of VCORE. Compared to the standard cross-entropy loss, VCORE approaches long-CoT SFT from an optimization perspective and adjusts token weights according to their gradient utility, thereby enabling more effective use of supervision signals and improving generalization.
  • Figure 2: Component Analysis and Ablation. (a) Impact of supervised set size on in-domain (Olympiad) and out-of-domain (SGPQA-1k) accuracy for VCORE | DFT; (b) Hyperparameters: reweighting temperature $\tau$ and probing scale $\epsilon$. All results use Qwen3-4B. Metrics are accuracy (%) on Olympiad (in-domain) and SGPQA-1k (out-of-domain).
  • Figure 3: Loss Scaling. Loss curves of Qwen3-4B trained on the math domain with and without loss scaling ($\epsilon = 1e{-4}$, $\tau = 5e{3}$).
  • Figure 4: Construction of GPQA-1k.