VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning
Kaleem Ullah Qasim, Jiashu Zhang, Hao Li, Muhammad Kafeel Shaheen
TL;DR
Verify-RL tackles the lack of guarantees in recursive decomposition for mathematical reasoning by enforcing three verifiable properties $V1$–$V3$ derived from symbolic differentiation. It formalizes a problem space with nesting depth $\delta$ and defines verifiable decomposition via chain, product, and sum rules, with verification automated through symbolic computation to achieve complete verification by construction. Curriculum construction guarantees prerequisite ordering through topological sorting of verified decomposition trees, and policy optimization uses Group Relative Policy Optimization (GRPO) with a rule-based reward. Across models spanning $0.6$B–$3$B parameters, Verify-RL yields substantial gains, notably boosting hardest-level accuracy from $32\%$ to $68\%$ and achieving up to $+112\%$ relative improvement, with ablations confirming the critical role of $V1$–$V3$ and the verified curriculum. The framework improves sample efficiency and reliability of mathematical reasoning in RL for LLMs, suggesting a path toward robust, verifiable reasoning in domains with formal rule structures.
Abstract
Training language models to solve complex mathematical problems benefits from curriculum learning progressively training on simpler subproblems. However, existing decomposition methods are often heuristic, offering no guarantees that subproblems are simpler, that solving them aids the parent task, or that their relationships are mathematically grounded. We observe that symbolic differentiation provides a natural structure for verified decomposition: calculus rules explicitly define how expressions reduce to simpler components with provable properties. We introduce Verify-RL, a framework where every parent-child decomposition satisfies three verifiable conditions: strictly decreasing structural complexity, solution containment, and formal rule derivation. Unlike heuristic methods where a significant fraction of decompositions are invalid our properties admit automatic verification through symbolic computation, achieving "verification by construction" Experiments demonstrate that eliminating invalid decompositions yields sizable gains, accuracy on the hardest problems more than doubles from 32% to 68%, with a 40% relative improvement overall.
