Table of Contents
Fetching ...

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun

TL;DR

This work systematically investigates training agentic models at the 4B scale, identifying catastrophic forgetting, reward-noise sensitivity, and long-context contamination as key bottlenecks. It introduces AgentCPM-Explore, a three-stage framework combining parameter-space model merging, reward signal denoising, and context information refinement to enable long-horizon deep exploration in edge-scale agents. Empirical results show 4B agents achieving SOTA performance among peers and rivaling larger models on multiple benchmarks, with GAIA pass@64 reaching 97.09% under extended inference. The study demonstrates that with a carefully designed training framework, edge-scale models can realize substantial problem-solving capabilities previously attributed mainly to larger models, offering practical impact for privacy-preserving, low-resource intelligent agents.

Abstract

While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

TL;DR

This work systematically investigates training agentic models at the 4B scale, identifying catastrophic forgetting, reward-noise sensitivity, and long-context contamination as key bottlenecks. It introduces AgentCPM-Explore, a three-stage framework combining parameter-space model merging, reward signal denoising, and context information refinement to enable long-horizon deep exploration in edge-scale agents. Empirical results show 4B agents achieving SOTA performance among peers and rivaling larger models on multiple benchmarks, with GAIA pass@64 reaching 97.09% under extended inference. The study demonstrates that with a carefully designed training framework, edge-scale models can realize substantial problem-solving capabilities previously attributed mainly to larger models, offering practical impact for privacy-preserving, low-resource intelligent agents.

Abstract

While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.
Paper Structure (48 sections, 6 equations, 8 figures, 1 table)

This paper contains 48 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overall Training Framework of AgentCPM-Explore.
  • Figure 2: Overall performance of different summary models in the same agent system on the GAIA benchmark. Abbreviations are as follows. DS: DeepSeek; Qwen3-4B-I: Qwen3-4B-Instruct-2507; Qwen3-4B-T: Qwen3-4B-Thinking-2507; FT: Fine-tuned.
  • Figure 3: RL learning curves in different train settings.
  • Figure 4: Pass@K performance comparison on the GAIA benchmark. The figure illustrates the capability boundary of the 4B model by comparing the "SFT-Merge" baseline with the final AgentCPM-Explore (RL after merging). The Qwen3-4B-thinking-2507 serves as the base model.
  • Figure 5: Standardized agentic training and inference infrastructure.
  • ...and 3 more figures