Table of Contents
Fetching ...

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Xin Xu, Clive Bai, Kai Yang, Tianhao Chen, Yangkun Chen, Weijie Liu, Hao Chen, Yang Wang, Saiyong Yang, Can Yang

TL;DR

Composition-RL introduces Sequential Prompt Composition (SPC) to automatically synthesize compositional prompts from existing verifiable prompts and trains reinforcement learning with verifiable rewards on these prompts. A curriculum variant progressively increases compositional depth, further enhancing reasoning capabilities, especially in larger models, and enabling cross-domain benefits by composing prompts from different domains. Empirical results across 4B–30B models show consistent gains on math benchmarks (e.g., AIME family) and improved out-of-domain multi-task reasoning, with notable gains for larger models and when using deeper curricula. The work illuminates compositional generalization and implicit process supervision as mechanisms behind the performance gains and suggests promising directions for general-domain RL with composed prompts.

Abstract

Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

TL;DR

Composition-RL introduces Sequential Prompt Composition (SPC) to automatically synthesize compositional prompts from existing verifiable prompts and trains reinforcement learning with verifiable rewards on these prompts. A curriculum variant progressively increases compositional depth, further enhancing reasoning capabilities, especially in larger models, and enabling cross-domain benefits by composing prompts from different domains. Empirical results across 4B–30B models show consistent gains on math benchmarks (e.g., AIME family) and improved out-of-domain multi-task reasoning, with notable gains for larger models and when using deeper curricula. The work illuminates compositional generalization and implicit process supervision as mechanisms behind the performance gains and suggests promising directions for general-domain RL with composed prompts.

Abstract

Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.
Paper Structure (31 sections, 9 equations, 6 figures, 3 tables)

This paper contains 31 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of Composition-RL. Top: an example of composing two math problems, illustrating the high-level idea of Composition-RL. Bottom left:pass@1 (%) on AIME24 versus training steps for different methods, summarizing key findings in \ref{['subsec:finding1', 'subsec:finding2']}. Bottom right: cross-topic results on MMLU-Pro subjects with the top-5 largest sample sizes, highlighting the main finding in \ref{['subsec:finding3']}.
  • Figure 2: Visualization of meta-experiments. Left:solve_all ratio curve for RL of Qwen3-4B-Base with original prompts (MATH12K) versus compositional prompts. Right:avg@8 accuracy on a subset of MATH500 and its corresponding compositional test prompts.
  • Figure 3: Left: avg@8 accuracy on a subset of MATH500 and the corresponding compositional test prompts across different model sizes. The darker color and the numbers denote the improvement of our Composition-RL over the RL training on the MATH12K baseline. Right: The fraction of prompts for which $q_{1:2}$ is solved correctly, and the accuracy of recovering $v_1$ at each training step.
  • Figure 4: The Prompt for Verifying the Correctness of Finding $v_1$ in LLMs' Response.
  • Figure 5: The Prompt for Generating Variable $v_1$ and Definition $d_1$ for $q_1$.
  • ...and 1 more figures