Table of Contents
Fetching ...

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang

TL;DR

CM2 tackles the challenge of training multi-turn, multi-step agentic tool-use agents when verifiable rewards are scarce by replacing scalar outcomes with binary, evidence-grounded checklist rewards. It trains in an LLM-simulated tool environment with thousands of tools, using sparse reward assignment and dense criteria, and demonstrates consistent improvements over supervised fine-tuning across multiple benchmarks. The approach matches or exceeds open-source baselines of similar size and can even rival judging-model baselines in several tasks, indicating robust generalization. The work also outlines scalable scaling strategies, such as more checklists, ensemble judging, and larger models, offering a practical pathway toward large-scale agentic tool-use optimization without heavy environment engineering.

Abstract

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

TL;DR

CM2 tackles the challenge of training multi-turn, multi-step agentic tool-use agents when verifiable rewards are scarce by replacing scalar outcomes with binary, evidence-grounded checklist rewards. It trains in an LLM-simulated tool environment with thousands of tools, using sparse reward assignment and dense criteria, and demonstrates consistent improvements over supervised fine-tuning across multiple benchmarks. The approach matches or exceeds open-source baselines of similar size and can even rival judging-model baselines in several tasks, indicating robust generalization. The work also outlines scalable scaling strategies, such as more checklists, ensemble judging, and larger models, offering a practical pathway toward large-scale agentic tool-use optimization without heavy environment engineering.

Abstract

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
Paper Structure (42 sections, 11 equations, 3 figures, 4 tables)

This paper contains 42 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our CM2. Starting from multi-turn, multi-step tool-use trajectories, we perform data filtering, CoT compression, and cold-start SFT, then annotate a per-turn checklist with evidence-grounded binary criteria and structured metadata. RL training is carried out in an LLM-simulated tool environment, where a LLM simulator produces tool responses and an LLM-as-a-Judge evaluates checklist items to compute rewards. The bottom panel contrasts dense criteria granularity with sparse reward assignment at different assignment granularities.
  • Figure 2: Example of One Checklist Item
  • Figure 3: Comparison results under different settings.