AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

Haotian Chen; Xin Cong; Shengda Fan; Yuyang Fu; Ziqin Gong; Yaxi Lu; Yishan Li; Boye Niu; Chengjun Pan; Zijun Song; Huadong Wang; Yesai Wu; Yueying Wu; Zihao Xie; Yukun Yan; Zhong Zhang; Yankai Lin; Zhiyuan Liu; Maosong Sun

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun

TL;DR

This work systematically investigates training agentic models at the 4B scale, identifying catastrophic forgetting, reward-noise sensitivity, and long-context contamination as key bottlenecks. It introduces AgentCPM-Explore, a three-stage framework combining parameter-space model merging, reward signal denoising, and context information refinement to enable long-horizon deep exploration in edge-scale agents. Empirical results show 4B agents achieving SOTA performance among peers and rivaling larger models on multiple benchmarks, with GAIA pass@64 reaching 97.09% under extended inference. The study demonstrates that with a carefully designed training framework, edge-scale models can realize substantial problem-solving capabilities previously attributed mainly to larger models, offering practical impact for privacy-preserving, low-resource intelligent agents.

Abstract

While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

TL;DR

Abstract

Paper Structure (48 sections, 6 equations, 8 figures, 1 table)

This paper contains 48 sections, 6 equations, 8 figures, 1 table.

Introduction
Methodology
Problem Formulation
Parameter-Space Model Merging
Reward Signal Denoising Mechanism
Environmental Noise Filtering.
Format Error Filtering.
Extreme Trajectory Filtering.
Context Information Refinement
Experiment
Experimental Settings
Main Results
Analysis of Context Information Refinement
Analysis of Reward Denoising Mechanism
Visualize the Capability Boundary of Edge-Scale Models
...and 33 more sections

Figures (8)

Figure 1: Overall Training Framework of AgentCPM-Explore.
Figure 2: Overall performance of different summary models in the same agent system on the GAIA benchmark. Abbreviations are as follows. DS: DeepSeek; Qwen3-4B-I: Qwen3-4B-Instruct-2507; Qwen3-4B-T: Qwen3-4B-Thinking-2507; FT: Fine-tuned.
Figure 3: RL learning curves in different train settings.
Figure 4: Pass@K performance comparison on the GAIA benchmark. The figure illustrates the capability boundary of the 4B model by comparing the "SFT-Merge" baseline with the final AgentCPM-Explore (RL after merging). The Qwen3-4B-thinking-2507 serves as the base model.
Figure 5: Standardized agentic training and inference infrastructure.
...and 3 more figures

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

TL;DR

Abstract

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (8)