Table of Contents
Fetching ...

Scheduling Your LLM Reinforcement Learning with Reasoning Trees

Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, Jiawei Chen

TL;DR

This work reframes RLVR for LLMs through the lens of reasoning trees, proposing a novel learning-efficiency metric, the Reasoning Score (r-score), to quantify how easily a query can improve under a limited node-editing budget. Building on this, the authors introduce Re-Schedule, a dynamic curriculum that prioritizes structurally simple queries early in training and gradually shifts to more complex ones, using an offline approximation of each query's reasoning tree. Empirical results on six math-reasoning benchmarks show that Re-Schedule consistently outperforms baselines, achieving up to 3.2% higher average accuracy and establishing that tree-structure information is a more powerful predictor of learnability than final-path accuracy. The work provides a principled, scalable approach to RLVR data scheduling with practical implications for training efficient, capable LLMs on reasoning tasks.

Abstract

Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's `Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query's learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.

Scheduling Your LLM Reinforcement Learning with Reasoning Trees

TL;DR

This work reframes RLVR for LLMs through the lens of reasoning trees, proposing a novel learning-efficiency metric, the Reasoning Score (r-score), to quantify how easily a query can improve under a limited node-editing budget. Building on this, the authors introduce Re-Schedule, a dynamic curriculum that prioritizes structurally simple queries early in training and gradually shifts to more complex ones, using an offline approximation of each query's reasoning tree. Empirical results on six math-reasoning benchmarks show that Re-Schedule consistently outperforms baselines, achieving up to 3.2% higher average accuracy and establishing that tree-structure information is a more powerful predictor of learnability than final-path accuracy. The work provides a principled, scalable approach to RLVR data scheduling with practical implications for training efficient, capable LLMs on reasoning tasks.

Abstract

Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's `Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query's learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.

Paper Structure

This paper contains 23 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) A simple reasoning tree (q1) requires less node editing for performance improvement than a complex one (q2). (b) Consequently, q1 shows high training efficiency (steep learning curve) despite low initial accuracy, while q2's complex structure leads to low efficiency. (c) Our method leverages this structural insight to significantly outperform baselines on various datasets.
  • Figure 2: Accuracy Progression During Training. The solid line represents the average accuracy, and the shaded area indicates the range.
  • Figure 3: Overview of the Reasoning Tree Schedule (Re-Schedule) Algorithm.(a) Tree Construction: For each query, an approximate reasoning tree is constructed by sampling multiple solution paths from a base model. (b) R-Score Calculation: The tree's structure is analyzed to compute the r-score, a metric quantifying the query's learning potential. (c) Dynamic Weighting: The r-scores are used to dynamically weight each query during training, forming a curriculum that progresses from structurally simple (easy) to complex (hard) examples.
  • Figure 4: (a) The average MCN decreases over time, indicating successful tree optimization. (b) & (c) To compare metrics, we train models on the top 1/3 of data selected by each. The plots show the resulting (b) training accuracy and (c) test accuracy. The model used is Qwen2.5-Math-7B.