Table of Contents
Fetching ...

ToolBrain: A Flexible Reinforcement Learning Framework for Agentic Tools

Quy Minh Le, Minh Sao Khue Luu, Khanh-Tung Tran, Duc-Hai Nguyen, Hoang-Quoc-Viet Pham, Quan Le, Hoang Thanh Lam, Hoang D. Nguyen

TL;DR

ToolBrain tackles the challenge of enabling agentic AI to use tools by delivering a flexible reinforcement learning framework that supports multiple learning strategies ($GRPO$ and $DPO$) and a hybrid reward system combining user-defined signals with LLM-based judgments. Its Coach–Athlete paradigm separates high-level orchestration from task execution, aided by an Adapter that produces rich execution traces for RL feedback. The framework integrates features such as ToolRetriever for intelligent tool selection, zero-learning data generation, knowledge distillation, and efficient fine-tuning via Unsloth/QLoRA/BitsAndBytes, all demonstrated on an Email Search task with substantial gains over baselines. The results show faster convergence and robust tool-use improvements, highlighting ToolBrain’s potential to lower the barrier to deploying domain-adapted, tool-using agents in resource-constrained settings. Overall, ToolBrain offers a practical, extensible path for researchers and practitioners to rapidly develop and deploy capable agentic systems with configurable rewards and tooling.

Abstract

Effective tool use is essential for agentic AI, yet training agents to utilize tools remains challenging due to manually designed rewards, limited training data, and poor multi-tool selection, resulting in slow adaptation, wasted computational resources, and suboptimal performance. We introduce ToolBrain, a lightweight and user-friendly framework for coaching tool use in agentic models with flexible reinforcement learning (RL), easing the barriers for researchers and practitioners to adapt LLM-based agents to specific domains. It supports a wide range of training strategies, including RL algorithms such as GRPO and DPO, as well as supervised learning. ToolBrain enables custom reward callables directly on an agent's execution traces or simply utilizes an automated LLM-as-a-judge system for reward generation. It is packed with useful capabilities, including knowledge distillation from large to small models for efficient development, automatic task generation from tool descriptions, seamless tool retrieval, efficient fine-tuning pipelines with QLoRA through Unsloth, and quantized inference via bitsandbytes. We demonstrate ToolBrain through diverse use cases, such as training a CodeAct agent to autonomously execute email search tasks, showing fast, targeted improvements (up to 30.0%) in tool-use skills while keeping the codebase simple and extensible in Agentic AI. Our framework is publicly available at https://toolbrain.org.

ToolBrain: A Flexible Reinforcement Learning Framework for Agentic Tools

TL;DR

ToolBrain tackles the challenge of enabling agentic AI to use tools by delivering a flexible reinforcement learning framework that supports multiple learning strategies ( and ) and a hybrid reward system combining user-defined signals with LLM-based judgments. Its Coach–Athlete paradigm separates high-level orchestration from task execution, aided by an Adapter that produces rich execution traces for RL feedback. The framework integrates features such as ToolRetriever for intelligent tool selection, zero-learning data generation, knowledge distillation, and efficient fine-tuning via Unsloth/QLoRA/BitsAndBytes, all demonstrated on an Email Search task with substantial gains over baselines. The results show faster convergence and robust tool-use improvements, highlighting ToolBrain’s potential to lower the barrier to deploying domain-adapted, tool-using agents in resource-constrained settings. Overall, ToolBrain offers a practical, extensible path for researchers and practitioners to rapidly develop and deploy capable agentic systems with configurable rewards and tooling.

Abstract

Effective tool use is essential for agentic AI, yet training agents to utilize tools remains challenging due to manually designed rewards, limited training data, and poor multi-tool selection, resulting in slow adaptation, wasted computational resources, and suboptimal performance. We introduce ToolBrain, a lightweight and user-friendly framework for coaching tool use in agentic models with flexible reinforcement learning (RL), easing the barriers for researchers and practitioners to adapt LLM-based agents to specific domains. It supports a wide range of training strategies, including RL algorithms such as GRPO and DPO, as well as supervised learning. ToolBrain enables custom reward callables directly on an agent's execution traces or simply utilizes an automated LLM-as-a-judge system for reward generation. It is packed with useful capabilities, including knowledge distillation from large to small models for efficient development, automatic task generation from tool descriptions, seamless tool retrieval, efficient fine-tuning pipelines with QLoRA through Unsloth, and quantized inference via bitsandbytes. We demonstrate ToolBrain through diverse use cases, such as training a CodeAct agent to autonomously execute email search tasks, showing fast, targeted improvements (up to 30.0%) in tool-use skills while keeping the codebase simple and extensible in Agentic AI. Our framework is publicly available at https://toolbrain.org.

Paper Structure

This paper contains 22 sections, 7 figures, 3 tables, 3 algorithms.

Figures (7)

  • Figure 1: The ToolBrain architecture, visualizing its two primary phases. The Data Generation Loop (solid lines) illustrates the agent's task execution, which is observed by the adapter to produce an execution trace. The subsequent Learning Loop (dashed lines) shows how this trace is scored and used by the RL module to update the agent's model.
  • Figure 2: The ToolBrain API workflow. This single code block demonstrates ToolBrain's key features, including its flexible reward system, intelligent tool retrieval, support for multiple learning algorithms, automated data generation, and built-in strategies like knowledge distillation.
  • Figure 3: Comparison of 10-iteration windowed mean accuracy across RL fine-tuning iterations for models trained with distillation (orange) and without distillation (blue).
  • Figure 4: A typical failure trace from an untrained agent. The agent attempts a complex logical structure but fails due to a SyntaxError, resulting in a reward of 0.0.
  • Figure 5: A snapshot of a single training step, highlighting the core operations: trace collection, reward computation, and loss calculation.
  • ...and 2 more figures