Table of Contents
Fetching ...

RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use

Jiajun Chai, Guojun Yin, Zekun Xu, Chuhuai Yue, Yi Jia, Siyu Xia, Xiaohan Wang, Jiwen Jiang, Xiaoguang Li, Chengqi Dong, Hang He, Wei Lin

TL;DR

<3-5 sentence high-level summary> RLFactory introduces a plug-and-play reinforcement learning post-training framework designed to enhance multi-turn tool use by LLMs. It reconstructs the decision process by incorporating observation tokens from tool feedback into the MDP and employs a generate-parse-invoke-update loop with asynchronous tool calls to enable robust, scalable learning. The framework supports diverse reward signals (rule-based, model-judgment, and tool-verification) and decouples tool environment from training, enabling rapid integration of heterogeneous tools through MCP configurations. Experimental results on Search-R1 with Qwen3-4B show competitive performance and substantially higher training throughput compared to larger baselines, underscoring the practicality and efficiency of RLFactory for real-world tool augmentation tasks.

Abstract

Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools. We present RLFactory, a plug-and-play reinforcement learning post-training framework for multi-round tool use. RLFactory tackles (i) tool-call stability and adaptability amid tool heterogeneity and interface issues via an asyncio-based asynchronous caller and a decoupled tool/training architecture, and (ii) diverse evaluation needs via a reward layer supporting rule-based, model-judgment, and tool-verification signals. It reconstructs the MDP by introducing observation markers from tool feedback, closing the loop among model, tools, and environment, and implements a generate-parse-invoke-update workflow for dynamic policy optimization. On Search-R1 with Qwen3-4B, RLFactory achieves a 0.486 test score on the Natural Questions (NQ) dataset, surpassing larger models trained with similar techniques (e.g., Qwen2.5-7B-Instruct-GRPO at 0.473), and increases training throughput by 6.8x. RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios. Code: https://github.com/Simple-Efficient/RL-Factory.

RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use

TL;DR

<3-5 sentence high-level summary> RLFactory introduces a plug-and-play reinforcement learning post-training framework designed to enhance multi-turn tool use by LLMs. It reconstructs the decision process by incorporating observation tokens from tool feedback into the MDP and employs a generate-parse-invoke-update loop with asynchronous tool calls to enable robust, scalable learning. The framework supports diverse reward signals (rule-based, model-judgment, and tool-verification) and decouples tool environment from training, enabling rapid integration of heterogeneous tools through MCP configurations. Experimental results on Search-R1 with Qwen3-4B show competitive performance and substantially higher training throughput compared to larger baselines, underscoring the practicality and efficiency of RLFactory for real-world tool augmentation tasks.

Abstract

Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools. We present RLFactory, a plug-and-play reinforcement learning post-training framework for multi-round tool use. RLFactory tackles (i) tool-call stability and adaptability amid tool heterogeneity and interface issues via an asyncio-based asynchronous caller and a decoupled tool/training architecture, and (ii) diverse evaluation needs via a reward layer supporting rule-based, model-judgment, and tool-verification signals. It reconstructs the MDP by introducing observation markers from tool feedback, closing the loop among model, tools, and environment, and implements a generate-parse-invoke-update workflow for dynamic policy optimization. On Search-R1 with Qwen3-4B, RLFactory achieves a 0.486 test score on the Natural Questions (NQ) dataset, surpassing larger models trained with similar techniques (e.g., Qwen2.5-7B-Instruct-GRPO at 0.473), and increases training throughput by 6.8x. RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios. Code: https://github.com/Simple-Efficient/RL-Factory.

Paper Structure

This paper contains 11 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The "model-tool" collaborative paradigm.
  • Figure 2: The framework of RLFactory.
  • Figure 3: A diagram of the RLFactory layer structure. The basic layer can be used directly without any code modifications. The component layer has a tool manager implemented, and users can also design their tool managers. The application layer provides example search and image environments, allowing users to customize their environments based on specific problems.
  • Figure 4: RLFactory multi-round tool call logic diagram using GPRO as an example.
  • Figure 5: Mean reward score trends across different base model.