VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

Kaleem Ullah Qasim; Jiashu Zhang; Hao Li; Muhammad Kafeel Shaheen

VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

Kaleem Ullah Qasim, Jiashu Zhang, Hao Li, Muhammad Kafeel Shaheen

TL;DR

Verify-RL tackles the lack of guarantees in recursive decomposition for mathematical reasoning by enforcing three verifiable properties $V1$–$V3$ derived from symbolic differentiation. It formalizes a problem space with nesting depth $\delta$ and defines verifiable decomposition via chain, product, and sum rules, with verification automated through symbolic computation to achieve complete verification by construction. Curriculum construction guarantees prerequisite ordering through topological sorting of verified decomposition trees, and policy optimization uses Group Relative Policy Optimization (GRPO) with a rule-based reward. Across models spanning $0.6$B–$3$B parameters, Verify-RL yields substantial gains, notably boosting hardest-level accuracy from $32\%$ to $68\%$ and achieving up to $+112\%$ relative improvement, with ablations confirming the critical role of $V1$–$V3$ and the verified curriculum. The framework improves sample efficiency and reliability of mathematical reasoning in RL for LLMs, suggesting a path toward robust, verifiable reasoning in domains with formal rule structures.

Abstract

Training language models to solve complex mathematical problems benefits from curriculum learning progressively training on simpler subproblems. However, existing decomposition methods are often heuristic, offering no guarantees that subproblems are simpler, that solving them aids the parent task, or that their relationships are mathematically grounded. We observe that symbolic differentiation provides a natural structure for verified decomposition: calculus rules explicitly define how expressions reduce to simpler components with provable properties. We introduce Verify-RL, a framework where every parent-child decomposition satisfies three verifiable conditions: strictly decreasing structural complexity, solution containment, and formal rule derivation. Unlike heuristic methods where a significant fraction of decompositions are invalid our properties admit automatic verification through symbolic computation, achieving "verification by construction" Experiments demonstrate that eliminating invalid decompositions yields sizable gains, accuracy on the hardest problems more than doubles from 32% to 68%, with a 40% relative improvement overall.

VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

TL;DR

Verify-RL tackles the lack of guarantees in recursive decomposition for mathematical reasoning by enforcing three verifiable properties

–

derived from symbolic differentiation. It formalizes a problem space with nesting depth

and defines verifiable decomposition via chain, product, and sum rules, with verification automated through symbolic computation to achieve complete verification by construction. Curriculum construction guarantees prerequisite ordering through topological sorting of verified decomposition trees, and policy optimization uses Group Relative Policy Optimization (GRPO) with a rule-based reward. Across models spanning

B–

B parameters, Verify-RL yields substantial gains, notably boosting hardest-level accuracy from

and achieving up to

relative improvement, with ablations confirming the critical role of

–

and the verified curriculum. The framework improves sample efficiency and reliability of mathematical reasoning in RL for LLMs, suggesting a path toward robust, verifiable reasoning in domains with formal rule structures.

Abstract

Paper Structure (25 sections, 4 theorems, 13 equations, 6 figures, 10 tables)

This paper contains 25 sections, 4 theorems, 13 equations, 6 figures, 10 tables.

Introduction
Related Work
Methods
Problem Space and Complexity Measure
Verifiable Decomposition Theory
Verification Procedures
Training with Verified Curriculum
Curriculum Construction
Policy Optimization
Reward Function
Experiments
Experimental Setup
Dataset
Training Configuration
Evaluation Metrics
...and 10 more sections

Key Result

Theorem 1

For $p = f(g(x))$ with $g(x) \neq x$ and $c = g(x)$: $\delta(c) < \delta(p)$, $\sigma(c) \preceq \sigma(p)$, and $c \sqsubseteq p$.

Figures (6)

Figure 1: Verify-RL framework. Problems pass through rule-based decomposition, verification (V1– V3), and curriculum ordering before RL training. Failed verifications trigger re-decomposition.
Figure 2: Pseudocode for verified decomposition (left) and curriculum construction (right). Verify checks all three properties V1– V3 before accepting a child problem.
Figure 3: Accuracy across difficulty levels D1– D5 for four models (rows) trained with three RL algorithms (columns). Each cell contains five circles representing accuracy at D1– D5, with circle darkness indicating performance. Baseline (grey) shows untrained accuracy, while Verifiable-GRPO (red), Verifiable-PPO (blue), and Verifiable-DPO (green) show post-training results. Baseline accuracy degrades sharply from D1 to D5 (e.g., Qwen3-1.5B: 88% to 32%), while trained models maintain higher accuracy throughout. GRPO achieves the best results across all models, with D5 improvements ranging from +45% (Qwen3-3B) to +112% (Qwen3-1.5B). Statistics below each cell show decay rate, average accuracy, and range.
Figure 4: Component ablation on Qwen3-1.5B. Curriculum order, reward complexity, and decomposition depth contributions. Value $\Delta$ shows change from full Verify-RL (GRPO).
Figure 5: Verification property satisfaction rates across decomposition methods. Verify-RL achieves 100% on V1--V3 through calculus rule derivation. LADDER (LLM-generated): V1 94.2%, V2 88.7%, V3 91.3%, with 78.4% joint verification. Random easier problems: V1 100%, V2 12.3%, V3 0%.
...and 1 more figures

Theorems & Definitions (15)

Definition 1: Expression Tree
Definition 2: Nesting Depth
Definition 3: Difficulty Partition
Definition 4: Structural Containment
Definition 5: Solution Containment
Definition 6: Verifiable Decomposition
Theorem 1: Chain Rule Satisfies V1– V3
proof
Theorem 2: Product Rule Satisfies V1– V3
proof
...and 5 more

VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

TL;DR

Abstract

VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (15)