Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Jonas Hübotter; Leander Diaz-Bone; Ido Hakimi; Andreas Krause; Moritz Hardt

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt

TL;DR

This work introduces TTC-RL, a framework for continual, test-time improvement of LLMs by automatically constructing a task-focused curriculum from a large, diverse corpus and training with on-policy RL to specialize toward a target task. The pipeline leverages SIFT for data selection and GRPO for learning, enabling efficient, targeted practice and yielding substantial performance gains across math, coding, and scientific reasoning benchmarks, while also raising the model’s performance ceiling beyond initial context constraints. A diverse verifiable corpus supports post-training generalization across domains, and the authors propose latent improvement as a metric to quantify genuine reasoning enhancements beyond simply better formatting. The results demonstrate complementary interactions with existing test-time scaling methods, show specialization to target tasks, and outline directions for continual self-improvement and robust data handling in future work.

Abstract

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

TL;DR

Abstract

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)