TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai; Qiang Zhang; Hanqing Zeng; Yunkai Zhang; Dipesh Tamboli; Xiangjun Fan; Zhuokai Zhao; Lizhu Zhang

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang

Abstract

Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Abstract

Paper Structure (39 sections, 12 equations, 3 figures, 11 tables)

This paper contains 39 sections, 12 equations, 3 figures, 11 tables.

Introduction
Related Work
Test-time alignment.
Post-training methods for reasoning.
Mixture of Experts.
Method
Preliminaries
Reasoning Reward LLM
Learnable Token-level Router
Full-logits concatenation.
Top-$k$ logits with index embedding.
Router design.
Final guided decoding.
Experiment
Experimental Setup
...and 24 more sections

Figures (3)

Figure 1: Performance on MATH500 (accuracy) and AlpacaEval (length-controlled win rate) for the state-of-the-art test-time alignment approach (GenARM) under different mixing coefficients $\alpha \in [0,1]$. An $\alpha=0$ corresponds to decoding solely from the base model, while $\alpha=1$ uses only the reward model.
Figure 2: Learnable token-level router design. At each LLM decoding step $t$, the base and reward models produce logits $z^{\text{base}}_t$ and $z^{\text{reward}}_t$. The logits are passed as input to Feature Concat, which either (i) concatenates logits, or (ii) concatenates logits plus learnable token-index embeddings (as discussed in §\ref{['subsec: router']}). The router consumes the concatenated feature and outputs a routing weight $\alpha_t \in (0,1)$. The guided distribution $(1-\alpha_t)\,z^{\text{base}}_t+\alpha_t\,z^{\text{reward}}_t$ is then used for sampling next token. This design makes the router portable across base model scales and families.
Figure 3: Weak-to-strong generalization of learned router on reasoning. Learned router and reasoning reward model are not retrained for this scale.

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Abstract

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Authors

Abstract

Table of Contents

Figures (3)