Learning to Reason from Feedback at Test-Time
Yanyang Li, Michael Lyu, Liwei Wang
TL;DR
The paper tackles the challenge of exploiting test-time feedback for complex reasoning tasks by introducing Feedback-based Test-Time Training (FTTT), which stores knowledge in model weights and uses a binary verifier plus optional self-reflection to guide learning. It couples FTTT with OpTune, a lightweight gradient-space optimizer that predicts weight updates from recent attempts, enabling scalable test-time optimization with minimal parameter overhead. Empirical results across math and coding datasets show that FTTT improves test-time scalability and, when integrated with OpTune, outperforms common PEFT baselines while maintaining efficiency. The work advances practical, memory-efficient test-time adaptation for large language models, with potential extensions to continuous feedback settings in future work.
Abstract
Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.
