SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization
Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai
TL;DR
SPINE tackles test-time distribution shift in autoregressive reasoning models by learning selectively at decision points rather than updating all tokens. It identifies fork points via token entropy and applies an entropy-band regularizer to maintain exploration without overfitting to noisy pseudo-rewards, all within a GRPO-based, label-free framework. Across ten benchmarks spanning multimodal VQA, mathematical reasoning, and medical/general QA, SPINE consistently outperforms standard TTRL and no-adaptation baselines, while avoiding response-length collapse and entropy drift. The approach demonstrates robust generalization, stable training dynamics, and practical effectiveness for test-time reasoning in both LLMs and MLLMs. SPINE offers a simple, efficient mechanism for reliable, label-free test-time adaptation in complex reasoning tasks.
Abstract
Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at https://github.com/JianghaoWu/SPINE.
