Table of Contents
Fetching ...

Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling

Qi Wang, Hongzhi Zhang, Jia Fu, Kai Fu, Yahui Liu, Tinghai Zhang, Chenxi Sun, Gangwei Jiang, Jingyi Tang, Xingguang Ji, Yang Yue, Jingyuan Zhang, Fuzheng Zhang, Kun Gai, Guorui Zhou

TL;DR

Klear-AgentForge addresses the need for an open, scalable pipeline to train agentic LLMs that can interact with tools and code environments. It combines a two-stage training flow—supervised fine-tuning on synthetic and open data, followed by multi-turn reinforcement learning with a mixed reward scheme—and uses disaggregated, asynchronous training plus a model-merge strategy to create a unified agentic model. The 8B variant achieves state-of-the-art results among similarly sized models across tool-use and coding benchmarks, demonstrating the practical viability of post-training scaling in agentic LLMs. The work also analyzes scaling effects, data composition, and test-time strategies, highlighting the importance of robust verifiers and environment-grounded evaluation for future improvements.

Abstract

Despite the proliferation of powerful agentic models, the lack of critical post-training details hinders the development of strong counterparts in the open-source community. In this study, we present a comprehensive and fully open-source pipeline for training a high-performance agentic model for interacting with external tools and environments, named Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks. We perform exclusive experiments on various agentic benchmarks in both tool use and coding domains. Klear-Qwen3-AgentForge-8B achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models.

Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling

TL;DR

Klear-AgentForge addresses the need for an open, scalable pipeline to train agentic LLMs that can interact with tools and code environments. It combines a two-stage training flow—supervised fine-tuning on synthetic and open data, followed by multi-turn reinforcement learning with a mixed reward scheme—and uses disaggregated, asynchronous training plus a model-merge strategy to create a unified agentic model. The 8B variant achieves state-of-the-art results among similarly sized models across tool-use and coding benchmarks, demonstrating the practical viability of post-training scaling in agentic LLMs. The work also analyzes scaling effects, data composition, and test-time strategies, highlighting the importance of robust verifiers and environment-grounded evaluation for future improvements.

Abstract

Despite the proliferation of powerful agentic models, the lack of critical post-training details hinders the development of strong counterparts in the open-source community. In this study, we present a comprehensive and fully open-source pipeline for training a high-performance agentic model for interacting with external tools and environments, named Klear-Qwen3-AgentForge, starting from the Qwen3-8B base model. We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks. We perform exclusive experiments on various agentic benchmarks in both tool use and coding domains. Klear-Qwen3-AgentForge-8B achieves state-of-the-art performance among LLMs of similar size and remains competitive with significantly larger models.

Paper Structure

This paper contains 32 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Klear-AgentForge main results on multiple agentic tasks.
  • Figure 2: Our multi-turn prompting pipeline for synthesizing tool-use data.
  • Figure 3: Comparison of scaling effects between single-trajectory and multi-trajectory SFT data on SWE-Bench (verified).
  • Figure 4: Tool use (Left), SWE (Middle), Code contest (Right) task training reward scores.
  • Figure 5: Steptime breakdown on SWE-bench Verified. For disaggregated framework, 4 nodes is used for training and the other 4 nodes for inference.
  • ...and 2 more figures