Table of Contents
Fetching ...

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai

TL;DR

SPINE tackles test-time distribution shift in autoregressive reasoning models by learning selectively at decision points rather than updating all tokens. It identifies fork points via token entropy and applies an entropy-band regularizer to maintain exploration without overfitting to noisy pseudo-rewards, all within a GRPO-based, label-free framework. Across ten benchmarks spanning multimodal VQA, mathematical reasoning, and medical/general QA, SPINE consistently outperforms standard TTRL and no-adaptation baselines, while avoiding response-length collapse and entropy drift. The approach demonstrates robust generalization, stable training dynamics, and practical effectiveness for test-time reasoning in both LLMs and MLLMs. SPINE offers a simple, efficient mechanism for reliable, label-free test-time adaptation in complex reasoning tasks.

Abstract

Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at https://github.com/JianghaoWu/SPINE.

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

TL;DR

SPINE tackles test-time distribution shift in autoregressive reasoning models by learning selectively at decision points rather than updating all tokens. It identifies fork points via token entropy and applies an entropy-band regularizer to maintain exploration without overfitting to noisy pseudo-rewards, all within a GRPO-based, label-free framework. Across ten benchmarks spanning multimodal VQA, mathematical reasoning, and medical/general QA, SPINE consistently outperforms standard TTRL and no-adaptation baselines, while avoiding response-length collapse and entropy drift. The approach demonstrates robust generalization, stable training dynamics, and practical effectiveness for test-time reasoning in both LLMs and MLLMs. SPINE offers a simple, efficient mechanism for reliable, label-free test-time adaptation in complex reasoning tasks.

Abstract

Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at https://github.com/JianghaoWu/SPINE.

Paper Structure

This paper contains 31 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Motivation of SPINE. (a) TTRL: sample multiple responses, majority vote forms a pseudo-label, then update with GRPO. (b) TTRL is unstable with shrinking outputs. (c) Entropy is skewed; the top 20% high-entropy tokens mark forking decisions. (d) SPINE updates only forking tokens and applies an entropy band, stabilizing adaptation and mitigating overfitting and forgetting.
  • Figure 2: SPINE pipeline. The model samples responses, majority voting produces a pseudo-label, and rewards are assigned. Gradients update only forking tokens, while flowing tokens are frozen. An entropy band further stabilizes training and preserves reasoning diversity.
  • Figure 3: Hyperparameter sensitivity: (a–b) show the effect of varying the lower and upper quantiles of the entropy band, while (c) varying the forking-token ratio on MMLU.
  • Figure 4: Sensitivity to scaling-related hyperparameters, with larger rollout $N$ and longer response lengths providing consistent improvements on the challenging AIME 2025.
  • Figure 5: Training dynamics comparison between SPINE (blue) and TTRL (red) on GPQA. (a) Majority-vote reward, (b) response length, (c) mean token entropy, and (d) Pass@1 accuracy.
  • ...and 1 more figures