Table of Contents
Fetching ...

VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

Kaleem Ullah Qasim, Jiashu Zhang, Hao Li, Muhammad Kafeel Shaheen

TL;DR

Verify-RL tackles the lack of guarantees in recursive decomposition for mathematical reasoning by enforcing three verifiable properties $V1$–$V3$ derived from symbolic differentiation. It formalizes a problem space with nesting depth $\delta$ and defines verifiable decomposition via chain, product, and sum rules, with verification automated through symbolic computation to achieve complete verification by construction. Curriculum construction guarantees prerequisite ordering through topological sorting of verified decomposition trees, and policy optimization uses Group Relative Policy Optimization (GRPO) with a rule-based reward. Across models spanning $0.6$B–$3$B parameters, Verify-RL yields substantial gains, notably boosting hardest-level accuracy from $32\%$ to $68\%$ and achieving up to $+112\%$ relative improvement, with ablations confirming the critical role of $V1$–$V3$ and the verified curriculum. The framework improves sample efficiency and reliability of mathematical reasoning in RL for LLMs, suggesting a path toward robust, verifiable reasoning in domains with formal rule structures.

Abstract

Training language models to solve complex mathematical problems benefits from curriculum learning progressively training on simpler subproblems. However, existing decomposition methods are often heuristic, offering no guarantees that subproblems are simpler, that solving them aids the parent task, or that their relationships are mathematically grounded. We observe that symbolic differentiation provides a natural structure for verified decomposition: calculus rules explicitly define how expressions reduce to simpler components with provable properties. We introduce Verify-RL, a framework where every parent-child decomposition satisfies three verifiable conditions: strictly decreasing structural complexity, solution containment, and formal rule derivation. Unlike heuristic methods where a significant fraction of decompositions are invalid our properties admit automatic verification through symbolic computation, achieving "verification by construction" Experiments demonstrate that eliminating invalid decompositions yields sizable gains, accuracy on the hardest problems more than doubles from 32% to 68%, with a 40% relative improvement overall.

VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

TL;DR

Verify-RL tackles the lack of guarantees in recursive decomposition for mathematical reasoning by enforcing three verifiable properties derived from symbolic differentiation. It formalizes a problem space with nesting depth and defines verifiable decomposition via chain, product, and sum rules, with verification automated through symbolic computation to achieve complete verification by construction. Curriculum construction guarantees prerequisite ordering through topological sorting of verified decomposition trees, and policy optimization uses Group Relative Policy Optimization (GRPO) with a rule-based reward. Across models spanning B–B parameters, Verify-RL yields substantial gains, notably boosting hardest-level accuracy from to and achieving up to relative improvement, with ablations confirming the critical role of and the verified curriculum. The framework improves sample efficiency and reliability of mathematical reasoning in RL for LLMs, suggesting a path toward robust, verifiable reasoning in domains with formal rule structures.

Abstract

Training language models to solve complex mathematical problems benefits from curriculum learning progressively training on simpler subproblems. However, existing decomposition methods are often heuristic, offering no guarantees that subproblems are simpler, that solving them aids the parent task, or that their relationships are mathematically grounded. We observe that symbolic differentiation provides a natural structure for verified decomposition: calculus rules explicitly define how expressions reduce to simpler components with provable properties. We introduce Verify-RL, a framework where every parent-child decomposition satisfies three verifiable conditions: strictly decreasing structural complexity, solution containment, and formal rule derivation. Unlike heuristic methods where a significant fraction of decompositions are invalid our properties admit automatic verification through symbolic computation, achieving "verification by construction" Experiments demonstrate that eliminating invalid decompositions yields sizable gains, accuracy on the hardest problems more than doubles from 32% to 68%, with a 40% relative improvement overall.
Paper Structure (25 sections, 4 theorems, 13 equations, 6 figures, 10 tables)

This paper contains 25 sections, 4 theorems, 13 equations, 6 figures, 10 tables.

Key Result

Theorem 1

For $p = f(g(x))$ with $g(x) \neq x$ and $c = g(x)$: $\delta(c) < \delta(p)$, $\sigma(c) \preceq \sigma(p)$, and $c \sqsubseteq p$.

Figures (6)

  • Figure 1: Verify-RL framework. Problems pass through rule-based decomposition, verification (V1– V3), and curriculum ordering before RL training. Failed verifications trigger re-decomposition.
  • Figure 2: Pseudocode for verified decomposition (left) and curriculum construction (right). Verify checks all three properties V1– V3 before accepting a child problem.
  • Figure 3: Accuracy across difficulty levels D1– D5 for four models (rows) trained with three RL algorithms (columns). Each cell contains five circles representing accuracy at D1– D5, with circle darkness indicating performance. Baseline (grey) shows untrained accuracy, while Verifiable-GRPO (red), Verifiable-PPO (blue), and Verifiable-DPO (green) show post-training results. Baseline accuracy degrades sharply from D1 to D5 (e.g., Qwen3-1.5B: 88% to 32%), while trained models maintain higher accuracy throughout. GRPO achieves the best results across all models, with D5 improvements ranging from +45% (Qwen3-3B) to +112% (Qwen3-1.5B). Statistics below each cell show decay rate, average accuracy, and range.
  • Figure 4: Component ablation on Qwen3-1.5B. Curriculum order, reward complexity, and decomposition depth contributions. Value $\Delta$ shows change from full Verify-RL (GRPO).
  • Figure 5: Verification property satisfaction rates across decomposition methods. Verify-RL achieves 100% on V1--V3 through calculus rule derivation. LADDER (LLM-generated): V1 94.2%, V2 88.7%, V3 91.3%, with 78.4% joint verification. Random easier problems: V1 100%, V2 12.3%, V3 0%.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Definition 1: Expression Tree
  • Definition 2: Nesting Depth
  • Definition 3: Difficulty Partition
  • Definition 4: Structural Containment
  • Definition 5: Solution Containment
  • Definition 6: Verifiable Decomposition
  • Theorem 1: Chain Rule Satisfies V1– V3
  • proof
  • Theorem 2: Product Rule Satisfies V1– V3
  • proof
  • ...and 5 more