Table of Contents
Fetching ...

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen, Jing Tang, Jianguo Li

TL;DR

TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code.

Abstract

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

TL;DR

TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code.

Abstract

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.
Paper Structure (33 sections, 5 equations, 8 figures, 18 tables)

This paper contains 33 sections, 5 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Overview of TAROT framework. (top) Build a four-tier test suite (basic/intermediate/complex/edge) per problem using frontier LLMs and verify them against the reference solution. (bottom) Reinforcement fine-tuning under a capability-conditioned, reward-decoupled curriculum. Less capable models perform best with basic $\rightarrow$ complex, whereas more capable models perform best with complex $\rightarrow$ basic.
  • Figure 2: Quantitative and qualitative validation of the TAROT dataset. The KDE plots show the distribution of structural complexity, where the x-axis represents the metric's magnitude. Token Diversity (unique/total tokens) and Transitions (character class changes) serve as proxies for lexical and syntactic complexity, respectively. The systematic rightward shift confirms increasing difficulty across tiers. GPT-4o validation on the right confirms that complex tiers target algorithmic complexity, while edge tiers focus on boundary conditions.
  • Figure 3: Experimental results for Qwen2.5-Instruct and Qwen2.5-Coder-Instruct on HumanEval, HumanEval+, MBPP, and MBPP+. Scores are pass@1. Numbers above bars indicate gains in percentage points relative to each model’s base checkpoint. Labels inside bars indicate the best performing curriculum strategy.
  • Figure 4: Experimental results for Qwen2.5-Instruct and Qwen2.5-Coder-Instruct models on CodeForces, LiveCodeBench v5 (LCBv5), and CruxEval. Scores are the overall accuracy across easy, medium, and hard problems. Numbers above bars indicate gains in percentage points relative to each model’s base checkpoint. Labels inside bars indicate the best performing curriculum strategy.
  • Figure 5: Performance sensitivity to the GRPO hyperparameter $\beta$. The plots show the final pass@1 or accuracy scores on various benchmarks as $\beta$ is varied. The optimal value is task-dependent; for instance, HumanEval and HumanEval+ benefit from a smaller $\beta$ (0.01) that allows greater policy exploration, whereas MBPP and CodeForces achieve peak performance with a larger $\beta$ (0.05) that enforces stronger regularization.
  • ...and 3 more figures