Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

Peihao Wang; Shan Yang; Xijun Wang; Tesi Xiao; Xin Liu; Changlong Yu; Yu Lou; Pan Li; Zhangyang Wang; Ming Lin; René Vidal

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

Peihao Wang, Shan Yang, Xijun Wang, Tesi Xiao, Xin Liu, Changlong Yu, Yu Lou, Pan Li, Zhangyang Wang, Ming Lin, René Vidal

Abstract

Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning as optimal control and introduce the Test-Time Control (TTC) layer, which performs finite-horizon LQR planning over latent states at inference time, represents a value function within neural architectures, and leverages it as the nested objective to enable planning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8% on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training.

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

Abstract

Paper Structure (55 sections, 8 theorems, 62 equations, 5 figures, 4 tables, 6 algorithms)

This paper contains 55 sections, 8 theorems, 62 equations, 5 figures, 4 tables, 6 algorithms.

Introduction
Main Contributions
Background: Memory-Based Architectures
More Memory-Based Architectures.
A Planning-Based Neural Architecture
Learning to Reinforce Learning at Test Time
Differentiating TTC Layers
Hardware Co-Design for TTC
Hardware-Efficient LQR Solver.
Structured Parameterization.
Kernel Fusion.
Caching for Backward.
Empirical Validation.
TTC-Net: A Hybrid Model with TTC Layer
Contextualization.
...and 40 more sections

Key Result

Proposition 3.1

The optimal $\boldsymbol{u}_{1}^*$ minimizing Eq. eqn:ttc depends linearly on $\boldsymbol{h}_{0}$ as $\boldsymbol{u}_{1}^* = \boldsymbol{K}^{*}_{1}\boldsymbol{h}_{0}$, where: for $t = 1, \cdots, T$, and $\boldsymbol{P}_{T} = \boldsymbol{Q}_{T}$.

Figures (5)

Figure 1: Memory-only prediction vs. unified memory-control planning. Memory-based models resemble human System 1 processing, relying on associative retrieval and producing fixed responses. Inspired by System 2 cognition, we model reasoning as an explicit optimal control problem and propose TTC-Net, a unified architecture that integrates test-time control (TTC) layers to encode value functions during sequential modeling, enabling planning before prediction.
Figure 2: Test-Time Control (TTC) layer. To predict the next token, TTC layers think through a receding-horizon LQR problem with linear state evolution and a cost function. A hardware-efficient solver computes optimal actions using structured matrix operations, enabling test-time control with minimal inference overhead.
Figure 3: Benchmarking running speed and memory of different LQR solvers. Throughput is reported in TFLOPs/s on a logarithmic scale. Among all evaluated solvers, our method achieves over a 10x higher throughput while maintaining constant memory cost w.r.t. horizon. Zero throughput indicates an out-of-memory error during execution.
Figure 4: Overview of TTC-Net. We construct a hybrid model by inserting a TTC layer between attention and MLP.
Figure 5: Test-time scaling with TTC layers. TTC allows scaling test-time compute to improve performance by enlarging the planning horizon $T$.

Theorems & Definitions (17)

Proposition 3.1: Riccati Iteration
Theorem 3.2
Theorem 3.3: Symplectic Iteration
Definition A.1: Linear-Quadratic Regulator kalman1960contributions
Proposition A.2
proof
Remark A.3
Theorem A.4: Forward symplectic iteration
proof
Theorem A.5: Reverse symplectic iteration
...and 7 more

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

Abstract

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (17)