Table of Contents
Fetching ...

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt

TL;DR

This work introduces TTC-RL, a framework for continual, test-time improvement of LLMs by automatically constructing a task-focused curriculum from a large, diverse corpus and training with on-policy RL to specialize toward a target task. The pipeline leverages SIFT for data selection and GRPO for learning, enabling efficient, targeted practice and yielding substantial performance gains across math, coding, and scientific reasoning benchmarks, while also raising the model’s performance ceiling beyond initial context constraints. A diverse verifiable corpus supports post-training generalization across domains, and the authors propose latent improvement as a metric to quantify genuine reasoning enhancements beyond simply better formatting. The results demonstrate complementary interactions with existing test-time scaling methods, show specialization to target tasks, and outline directions for continual self-improvement and robust data handling in future work.

Abstract

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

TL;DR

This work introduces TTC-RL, a framework for continual, test-time improvement of LLMs by automatically constructing a task-focused curriculum from a large, diverse corpus and training with on-policy RL to specialize toward a target task. The pipeline leverages SIFT for data selection and GRPO for learning, enabling efficient, targeted practice and yielding substantial performance gains across math, coding, and scientific reasoning benchmarks, while also raising the model’s performance ceiling beyond initial context constraints. A diverse verifiable corpus supports post-training generalization across domains, and the authors propose latent improvement as a metric to quantify genuine reasoning enhancements beyond simply better formatting. The results demonstrate complementary interactions with existing test-time scaling methods, show specialization to target tasks, and outline directions for continual self-improvement and robust data handling in future work.

Abstract

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

Paper Structure

This paper contains 53 sections, 25 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: Test-time curricula (TTCs) lead to remarkable improvements in math and coding by practicing on self-curated task-related problems at test-time. The plots show the pass@1 test accuracy of Qwen3-8B throughout its test-time training. Our method, TTC-RL (solid red line), consistently improves performance, learning faster and achieving a higher final accuracy than standard RL post-training (dashed gray line). Notably, the final pass@1 accuracy of TTC-RL approaches the model's initial pass@8 performance (dotted gray line), which represents a proxy for the performance ceiling of the initial model. The stars indicate the final pass@8 values after TTC-RL, demonstrating a significant improvement over the initial pass@8, which indicates that the model learns new solution strategies at test-time.
  • Figure 2: TTC-RL performs targeted practice on similar problems to the target task at test-time. The agent is given a target task (red) and self-curates a curriculum of related tasks (blue). It then explores solution strategies on this curriculum, reinforcing successful approaches ($\checkmark$). This experience enables the agent to more effectively solve the original, more difficult target task.
  • Figure 3: TTC-RL substantially outperforms general-purpose RL post-training for a range of data sizes. We evaluate Qwen3-8B on all seven benchmarks and report the average test accuracy when training for 250 steps.
  • Figure 4: TTC-RL scales test-time compute in way that is complementary to other means of test-time scaling.Left: The pass@$k$ of TTC-RL on Qwen3-8B, averaged over benchmarks, increases substantially for small and large $k$, indicating that TTC-RL raises the model's performance ceiling. Middle: TTC-RL also improves the performance of majority voting (across math and GPQA-D), with the initial pass@1 significantly outperforming maj@64 on the initial model. Right: We evaluate Qwen3-8B in non-thinking and thinking mode, as well as the non-thinking model + TTC-RL. The color indicates the relative accuracy per column. We find that TTC-RL significantly improves the non-thinking model, allowing it to perform close to the thinking variant in several domains, despite reasoning over 8k rather than 30k context tokens.
  • Figure 5: Left: Per-task TTC-RL outperforms a benchmark-level TTC in AIME25. We perform TTC-RL and maj-TTRL (cf. \ref{['sec:self_improv']}) on Qwen3-8B, and find that per-task TTC-RL even outperforms the benchmark-level TTC. Middle: TTC-RL improves "correctness" of reasoning, not only learning the answer format. We evaluate the difference in accuracy between TTC-RL and the initial Qwen3-8B, averaged over benchmarks. The latent improvement is a lower bound on the accuracy gain that is not due to merely learning the format (cf. \ref{['sec:learning_formatting']}). Right: TTC-RL yields models that are specialized to their target tasks. We plot the accuracy of Qwen3-8B trained for given target tasks (rows) when evaluated on other benchmarks (columns). We normalize accuracies across all evaluations of a particular benchmark. Notably, the model trained via TTC-RL for the "right" target tasks (i.e., the diagonal) always performs best.
  • ...and 11 more figures