Table of Contents
Fetching ...

Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, Walid Ahmed

TL;DR

The paper tackles the challenge of adapting large language models at test time to unlabeled, out-of-distribution data. It introduces VDS-TTT, a verifier-driven framework that generates multiple candidate responses per input, uses a learned verifier to select high-confidence pseudo-labels, and fine-tunes only low-rank LoRA adapters to achieve efficient, ongoing self-improvement. Across GSM8K, Math-500, and AIME benchmarks with three state-of-the-art LLMs, VDS-TTT delivers up to 32.29% relative improvement over the base model and 6.66% over verifier-based methods without test-time training, demonstrating robust adaptation under distribution shifts. The approach emphasizes practicality, achieving strong performance with modest test-time compute and parameter updates, while outlining limitations and directions for broader verifier applicability.

Abstract

Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.

Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

TL;DR

The paper tackles the challenge of adapting large language models at test time to unlabeled, out-of-distribution data. It introduces VDS-TTT, a verifier-driven framework that generates multiple candidate responses per input, uses a learned verifier to select high-confidence pseudo-labels, and fine-tunes only low-rank LoRA adapters to achieve efficient, ongoing self-improvement. Across GSM8K, Math-500, and AIME benchmarks with three state-of-the-art LLMs, VDS-TTT delivers up to 32.29% relative improvement over the base model and 6.66% over verifier-based methods without test-time training, demonstrating robust adaptation under distribution shifts. The approach emphasizes practicality, achieving strong performance with modest test-time compute and parameter updates, while outlining limitations and directions for broader verifier applicability.

Abstract

Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.

Paper Structure

This paper contains 11 sections, 2 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: In our VDS‑TTT framework, we proceed through three sequential stages. First, Candidate Generation for Self‑Annotation, where each input question $Q_i$ is passed to the pretrained LLM to produce a set of $N$ candidate responses $\{r_1,...,r_N\}$. Second, Confidence‑Guided Annotation, in which a verifier assigns a reliability score to each $r_j$, and we select the response $r^*$ only if its score exceeds a predefined threshold $\tau$, thereby forming the test‑time training example $(Q_i, r^*)$. Finally, Test‑Time Training, where we fine‑tune the model by optimizing its parameters on the resulting pseudo‑labeled dataset.
  • Figure 2: Three instances of TTT loss curves
  • Figure 3: Iterative VDS-TTT results