Table of Contents
Fetching ...

Learning to Reason from Feedback at Test-Time

Yanyang Li, Michael Lyu, Liwei Wang

TL;DR

The paper tackles the challenge of exploiting test-time feedback for complex reasoning tasks by introducing Feedback-based Test-Time Training (FTTT), which stores knowledge in model weights and uses a binary verifier plus optional self-reflection to guide learning. It couples FTTT with OpTune, a lightweight gradient-space optimizer that predicts weight updates from recent attempts, enabling scalable test-time optimization with minimal parameter overhead. Empirical results across math and coding datasets show that FTTT improves test-time scalability and, when integrated with OpTune, outperforms common PEFT baselines while maintaining efficiency. The work advances practical, memory-efficient test-time adaptation for large language models, with potential extensions to continuous feedback settings in future work.

Abstract

Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.

Learning to Reason from Feedback at Test-Time

TL;DR

The paper tackles the challenge of exploiting test-time feedback for complex reasoning tasks by introducing Feedback-based Test-Time Training (FTTT), which stores knowledge in model weights and uses a binary verifier plus optional self-reflection to guide learning. It couples FTTT with OpTune, a lightweight gradient-space optimizer that predicts weight updates from recent attempts, enabling scalable test-time optimization with minimal parameter overhead. Empirical results across math and coding datasets show that FTTT improves test-time scalability and, when integrated with OpTune, outperforms common PEFT baselines while maintaining efficiency. The work advances practical, memory-efficient test-time adaptation for large language models, with potential extensions to continuous feedback settings in future work.

Abstract

Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.

Paper Structure

This paper contains 30 sections, 7 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between sequential revision, parallel sampling, and feedback-based test-time training. is the failed attempt and is the successful attempt. indicates the LLM generation with the input on the left of the arrow and the output on the right. denotes the LLM training, where the left of the arrow is the training data.
  • Figure 2: The model architecture of OpTune.
  • Figure 3: The scaling trends of different methods under varying budgets. The colored area around the line denotes the standard deviation. The first row is the results of Llama-3.1-8B-Instruct and the second row is Mistral-7B-Instruct-v0.3.
  • Figure 4: The scaling trends of different fine-tuning methods under varying budgets. We report the mean results of three random trials. The first row is the results of Llama-3.1-8B-Instruct and the second row is Mistral-7B-Instruct-v0.3.
  • Figure 5: The training curves of PEFT methods when fine-tuning Llama-3.1-8B-Instruct on MBPP.