Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

Mohammad Mahdi Moradi; Hossam Amer; Sudhir Mudur; Weiwei Zhang; Yang Liu; Walid Ahmed

Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, Walid Ahmed

TL;DR

The paper tackles the challenge of adapting large language models at test time to unlabeled, out-of-distribution data. It introduces VDS-TTT, a verifier-driven framework that generates multiple candidate responses per input, uses a learned verifier to select high-confidence pseudo-labels, and fine-tunes only low-rank LoRA adapters to achieve efficient, ongoing self-improvement. Across GSM8K, Math-500, and AIME benchmarks with three state-of-the-art LLMs, VDS-TTT delivers up to 32.29% relative improvement over the base model and 6.66% over verifier-based methods without test-time training, demonstrating robust adaptation under distribution shifts. The approach emphasizes practicality, achieving strong performance with modest test-time compute and parameter updates, while outlining limitations and directions for broader verifier applicability.

Abstract

Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.

Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

TL;DR

Abstract

Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)